Model-free and model-based reinforcement learning have provided a successful framework for understanding both human behavior and neural data. These two systems are usually thought to compete for control of behavior. However, it has also been proposed that they can be integrated in a cooperative manner. For example, the Dyna algorithm uses model-based replay of past experience to train the model-free system, and has inspired research examining whether human learners do something similar. Here we introduce an approach that links model-free and model-based learning in a new way: via the reward function. Given a model of the learning environment, dynamic programming is used to iteratively estimate state values that monotonically converge to the state values under the optimal decision policy. Pseudorewards are calculated from these values and used to shape the reward function of a model-free learner in a way that is guaranteed not to change the optimal policy. In two experiments we show that this method offers computational advantages over Dyna. It also offers a new way to think about integrating model-free and model-based reinforcement learning: that our knowledge of the world doesn't just provide a source of simulated experience for training our instincts, but that it shapes the rewards that those instincts latch onto.




Download Full History