Objects exhibit organizational structure in their real-world setting (Biederman\etal, 1982). Contextual reasoning is part of human's visual understanding and has been modeled by various efforts in computer vision in the past (Torralba, 2001). Recently, object recognition has reached a new peak with the help of deep learning. State-of-the-art object recognition systems use convolutional neural networks (CNNs) to classify regions of interest in an image. The visual cues extracted for each region are limited to the content of the region and ignore the contextual information from the scene. So the question remains, how can we enhance convolutional neural networks with contextual reasoning to improve recognition?
Work presented in this manuscript shows how contextual cues conditioned on the scene and the object can improve CNNs' ability to recognize difficult, highly contextual objects from images. Turning to the most interesting object of all, people, contextual reasoning is a key for the fine-grained tasks of action and attribute recognition. Here, we demonstrate the importance of extracting cues in an instance-specific and category-specific manner tied to the task in question. Finally, we study motion which captures the change in shape and appearance in time and is a way to extract dynamic contextual cues. We show that coupling motion with the complementary signal of static visual appearance leads to a very effective representation for action recognition from videos.