Description
The main difficulty arises because each policy trial returns a noisy estimate of the performance measure. As a result, we have noisy gradient estimates. One source of noise is caused by the use of randomized policies (often used for exploration purposes). We use response surface models to predict the effect that this noise has on the observed performance. This allows us to reduce the variance of the gradient estimates, and we derive expressions for the minimal-variance model for a variety of problem settings. Other sources of noise come from the environment and from the agent's actuators. Sensor data, which partially measures the effect of this noise, can be used to explain away the noise-induced perturbations in the expected performance. We show how to incorporate the sensor information into the gradient estimation task, further reducing the variance of the gradient estimates. In addition, we show that useful sensor encodings have the following properties: the sensor data is uncorrelated with the agent's choice of action and the sensor data is correlated with the perturbations in performance. Finally, we demonstrate the effectiveness of our approach by learning controllers for a simulated dart thrower and quadruped locomotion task.