There are many challenges to learning optimal motor control. These challenges include noisy environments and sensors, nonlinear dynamics, continuous variables, high-dimensional problem domains, and redundancy. Reinforcement learning can be used, in principle, to find optimal controllers; however, the traditional learning algorithms are often too slow because obtaining training data is expensive. Although policy gradient methods have shown some promising results, they are limited by the rate at which they can accurately estimate the gradient of the objective function with respect to a given policy's parameters. These algorithms typically estimate the gradient from a number of policy trials. In the noisy setting, however, many policy trials may be necessary to achieve a desired level of performance. This dissertation presents techniques that may be used to minimize the total number of trials required.

The main difficulty arises because each policy trial returns a noisy estimate of the performance measure. As a result, we have noisy gradient estimates. One source of noise is caused by the use of randomized policies (often used for exploration purposes). We use response surface models to predict the effect that this noise has on the observed performance. This allows us to reduce the variance of the gradient estimates, and we derive expressions for the minimal-variance model for a variety of problem settings. Other sources of noise come from the environment and from the agent's actuators. Sensor data, which partially measures the effect of this noise, can be used to explain away the noise-induced perturbations in the expected performance. We show how to incorporate the sensor information into the gradient estimation task, further reducing the variance of the gradient estimates. In addition, we show that useful sensor encodings have the following properties: the sensor data is uncorrelated with the agent's choice of action and the sensor data is correlated with the perturbations in performance. Finally, we demonstrate the effectiveness of our approach by learning controllers for a simulated dart thrower and quadruped locomotion task.




Download Full History