Description
The key insight of this dissertation is that when a robot is deployed in an environment that humans have been acting in, the state of the environment is already optimized for what humans want, and is thus informative about human preferences.
We formalize this setting by assuming that a human H has been acting in an environment for some time, and a robot R observes the final state produced. From this final state, R must infer as much as possible about H's reward function. We analyze this problem formulation theoretically and show that it is particularly well suited to inferring aspects of the state that should not be changed -- exactly the aspects of the reward that H is likely to forget to specify. We develop an algorithm using dynamic programming for tabular environments, analogously to value iteration, and demonstrate its behavior on several simple environments. To scale to high-dimensional environments, we use function approximators judiciously to allow the various parts of our algorithm to be trained without needing to enumerate all possible states.
Of course, there is no point in learning about H's reward function unless we use it to guide R's decision-making. While we could have R simply optimize the inferred reward, this suffers from a "status quo bias": the inferred reward is likely to strongly prefer the observed state, since by assumption it is already optimized for H's preferences. To get R to make changes to the environment, we will usually need to integrate the inferred reward with other sources of preference information. In order to support such reward combination, we use a model in which R must maximize an unknown reward function known only to H. Learning from the state of the world arises as an instrumentally useful behavior in such a setting, and can serve to form a prior belief over the reward function that can then be updated after further interaction with H.