Description
This thesis encompasses the interaction of model-learning with decision making with respect to two central issues: compounding prediction errors and objective mismatch. The compounding error challenge emerges from accumulating errors on recursive passes of any one-step transition model. Most dynamics models are trained for single-step accuracy, which often results in models with substantial long-term prediction error. Additionally, the model being trained for accurate transitions need not guarantee high-performance policies on the downstream task. The lack of correlation between model and policy metrics in separate optimization is coined and studied as Objective Mismatch.
These challenges are primarily studied in the context of sample-based model predictive control (MPC) algorithms, where the learned model is used to simulate trajectories and their resulting predicted rewards. To mitigate compounding error and objective mismatch, the trajectory-based dynamics model is a feedforward prediction parametrization containing a direct representation of time. This model represents one small, but important steps towards more useful dynamics models in model-based reinforcement learning. This thesis concludes with future directions on the synergy of prediction and control in MBRL, primarily focused on state-abstractions, temporal correlation, and future prediction methodologies.