Reinforcement learning (RL) has demonstrated impressive performance in various domains such as video games, Go, robotic locomotion, and manipulation tasks. As we turn towards RL to power autonomous systems in the physical world, a natural question to ask is, how do we ensure that the behavior observed in the laboratory reflects the behavior that occurs when systems are deployed in the real world? How much data do we need to collect in order to learn how to control a system with a high degree of confidence?

This thesis takes a step towards answering these questions by establishing the Linear Quadratic Regulator (LQR) as a baseline for comparison of RL algorithms. LQR is a fundamental problem in optimal control theory for which the exact solution is efficiently computable with perfect knowledge of the underlying dynamics. This makes LQR well suited as a baseline for studying the sample complexity of RL algorithms which learn how to control from observing repeated interactions with the system.

The first part of this thesis focuses on model-based algorithms which estimate a model of the underlying system, and then build a controller based on the estimated dynamics. We show that the classic certainty equivalence controller, which discards confidence intervals surrounding the estimated dynamics, is efficient in regimes of low uncertainty. For regimes of moderate uncertainty, we propose a new model-based algorithm based on robust optimization, and show that it is also sample efficient.

The second part studies model-free algorithms which learn intermediate representations instead, or directly search for the parameters of the optimal controller. We first look at the classical least-squares policy iteration algorithm, and establish an upper bound on its sample complexity. We then use tools from asymptotic statistics to characterize the asymptotic behavior of both the certainty equivalence controller and the popular policy gradient method on a particular family of LQR instances, which allows us to directly compare the bounds. This comparison reveals that the model-free policy gradient method has polynomial in state/input dimension and horizon length worse sample complexity than the model-based certainty equivalence controller. Our experiments corroborate this finding and show that model-based algorithms are more sample efficient than model-free algorithms for LQR.




Download Full History