The world is messy and imperfect, unstructured and complex, and nonetheless we must still accomplish the basic behaviors necessary for survival. It is for this purpose, ecologically relevant behavior, that vision evolved 500-600 million years ago.

This thesis is about how learn representations of the visual world that are useful for the types of behaviors we might want an embodied AI system to do. In the first part of this thesis, we systematically study how bottlenecking visual inputs through different pretrained representations affects the ability of a robot to learn different atomic navigation skills (Chapter 2) and manipulation skills (Chapter 3) through trial-and-error. The main finding is that the appropriate pretrained representation greatly improves the sample efficiency for skill acquisition, and greatly improves the generalization of the learned skill. In the second part of the thesis, we use the lessons learned in order to improve the accuracy of the representations in a larger variety of contexts (indoors, outdoors, tabletop settings, and so on). In Chapter 4 we do this through adding cross-prediction consistency objectives. In Chapter 5 we do this by leveraging vast amounts of 3D data available on the internet and from a robot’s prior experience.

The methods are primarily developed for the purpose of vision and action, but many of the ideas are general and could work for other sensory modalities and behaviors.




Download Full History