How can we elicit the behaviors we want from artificial agents? One way of guiding behaviors of intelligent systems is through reward design. By specifying reward functions to optimize, we can use reinforcement learning (RL) to enable agents to learn from their own experience and interactions. Thus, RL has seen great success in settings where it is feasible to hand-specify reward functions that are well-aligned with the intended behaviors (e.g., using scores as rewards for games). However, as we progress to developing intelligent systems that have to learn more complex behaviors in the rich, diverse real world, reward design becomes increasingly difficult—and crucial. To address this challenge, we posit that improving reward signals will require new ways of incorporating human input.

This thesis comprises two main parts: reward design directly using human input or indirectly using general knowledge we have about people. In the first part, we propose a framework for building robust reward models from direct human feedback. We present a reward modeling formulation that is amenable to large-scale pretrained vision-language models, leading to more generalizable multimodal reward functions under visual and language distribution shifts. In the second part, we use broad knowledge about humans as novel forms of input for reward design. In the human assistance setting, we propose using human empowerment as a task-agnostic reward input. This enables us to train assistive agents that circumvent limitations of existing goal inference based methods, while also aiming to preserve human autonomy. Finally, we study the case of eliciting exploratory behaviors in artificial agents. Unlike prior work that indiscriminately optimizes for diversity in order to encourage exploration, we propose leveraging human priors and general world knowledge to design intrinsic reward functions that lead to more human-like exploration. To better understand how intrinsic objectives guiding human behavior can inform agent design, we also compare how well human and agent behaviors in an open-ended exploration setting align with commonly-proposed information theoretic objectives used as intrinsic rewards. We conclude with some reflections on reward design challenges and directions for future work.




Download Full History