Description
With respect to the first axis, the thesis presents three articles: (1) Data-Efficient Image Recognition using Contrastive Predictive Coding (CPCv2); (2) Contrastive Unsupervised Representations for Reinforcement Learning (CURL); (3) Reinforcement Learning with Augmented Data (RAD). The first two articles explore a form of unsupervised learning called contrastive learning, a technique better suited for raw inputs such as images compared to generative pre-training that is popular for language. The first article presents results for label-efficient image recognition. The second article presents the benefits of contrastive learning for sample-efficient reinforcement learning from pixels. Contrastive learning in practice is heavily dependent on data augmentations, and the third article presents a detailed investigation and discussion of its role.
As for the second axis, the thesis presents a thorough empirical investigation of the benefits of self-attention and Transformer-like architectures for computer vision through the article: Bottleneck Transformers for Visual Recognition. Self-attention has revolutionized language processing but computer vision presents a challenge to vanilla Transformers through high resolution inputs that challenge the quadratic memory and computational complexity of the primitive. The article presents the empirical effectiveness of a straightforward hybrid composed of convolutions and self-attention and unifies the ResNet and Transformer based architecture design for computer vision.