Reinforcement learning is an increasingly popular framework that enables robots to learn to perform tasks from prior experience in environments where dynamics or shaped reward functions are challenging to model. However, because this requires robots to sample trajectories under significant dynamical uncertainty, the robot may perform unsafe maneuvers during online exploration. This is particularly problematic in real-world robotics, where unsafe behaviors can lead to damage to surroundings. As a result, many impressive reinforcement learning results are in simulation only. Safe reinforcement learning is a field with a rich history that studies how to reduce the number and magnitude of unsafe behaviors during learning, particularly in the real world. Safe reinforcement learning is challenging, because it requires limiting exploration to provide safety, but enabling sufficient exploration to maximize the task reward function. Algorithms frequently draw inspiration from methods in control theory, constrained optimization, and online learning to adaptively balance task-driven exploration and safety based on prior experience.

This thesis presents a set of novel safe reinforcement learning algorithms that maintain subsets of the state space where safety is highly probable under the current policy. The algorithms leverage these safe sets in different ways to promote safety during online exploration in the real world. The first part of the thesis covers a class of algorithms that requires the robot to maintain a conservative safe set of states from which it has already completed the task. As long as the robot approximately maintains the ability to return to the safe set, the robot can explore outside the safe set and iteratively expand it. This thesis also presents strong theoretical guarantees for this class of algorithms under known but stochastic, nonlinear dynamics. The second part presents another class of algorithms that maintains a much larger safe set based on the probability of the robot committing unsafe behaviors. The robot uses the boundary of this set to determine whether it should focus on task-driven exploration or safety recovery maneuvers. The final part of this thesis covers an algorithm that uses policy uncertainty to implicitly model safety and request human interventions for corrective feedback. This thesis concludes with a commentary on lessons learned and future endeavors.




Download Full History