The rapid progress in visual recognition capabilities over the past several years can be attributed largely to improvements in generic and transferrable feature representations, particularly learned representations based on convolutional networks (convnets) trained “end-to-end” to predict visual semantics given raw pixel intensity values. In this thesis, we analyze the structure of these convnet representations and their generality and transferability to other tasks and settings.

We begin in Chapter 2 by examining the hierarchical semantic structure that naturally emerges in convnet representations from large-scale supervised training, even when this structure is unobserved in the training set. Empirically, the resulting representations generalize surprisingly well to classification in related yet distinct settings.

Chapters 3 and 4 showcase the flexibility of convnet-based representations for prediction tasks where the inputs or targets have more complex structure. Chapter 3 focuses on representation transfer to the object detection and semantic segmentation tasks in which objects must be localized within an image, as well as labeled. Chapter 4 augments convnets with recurrent structure to handle recognition problems with sequential inputs (e.g., video activity recognition) or outputs (e.g., image captioning). Across each of these domains, end-to-end fine-tuning of the representation for the target task provides a substantial additional performance benefit.

Finally, we address the necessity of label supervision for representation learning. In Chapter 5 we propose an unsupervised learning approach based on generative models, demonstrating that some of the transferrable semantic structure learned by supervised convnets can be learned from images alone.




Download Full History