Deploying learning systems in the real-world requires aligning their objectives with those of the humans they interact with. Existing algorithmic approaches for this alignment try to infer these objectives through human feedback. The correctness of these algorithms crucially depends on several simplifying assumptions on 1) how humans represent these objectives, 2) how humans respond to queries given these objectives, and 3) how well the hypothesis space represents these objectives. In this thesis, we question the robustness of existing approaches to misspecifications in these assumptions and develop principled approaches to overcome such misspecifications.

We begin by studying misspecifications in the hypothesis class assumed by the learner and propose an agnostic learning setup where we demonstrate that all existing approaches based on learning from comparisons would incur constant regret. We further show that it is necessary for humans to provide more detailed feedback in the form of higher-order comparisons and obtain sharp bounds on the regret as a function of the order of comparisons. Next, we focus on misspecifications in human behavioral models and establish, through both theoretical and empirical analyses, that inverse RL methods can be extremely brittle in worst case. However, under reasonable assumptions, we exhibit that these methods do exhibit robustness and are able to recover underlying reward functions up to a small error term. We then proceed to study misspecifications in assumptions on how humans represent objective functions. We begin by showing that taking a uni-criterion approach to modeling human preferences fails to capture real-world human objectives and propose a new multi-criteria comparison based framework which overcome these limitations. In the next part, we shift our focus to hand-specified reward functions in reinforcement learning, an alternative to learning rewards from humans. We empirically study the effects of such misspecifications showing that over-optimizing such proxy rewards can hurt performance in the long run.




Download Full History