In recent times, computer vision has made great leaps towards 2D understanding of sparse visual snapshots of the world. This is insufficient for robots that need to exist and act in the 3D world around them based on a continuous stream of multi-modal inputs. In this work, we present efforts towards bridging this gap between computer vision and robotics. We show how thinking about computer vision and robotics together brings out limitations of current computer vision tasks and techniques, and motivates joint study of perception and action. We present some initial efforts towards this and investigate a) how we can move from 2D understanding of images to 3D understanding of the underlying scene, b) how recent advances in representation learning for images can be extended to obtain representations for varied sensing modalities useful in robotics, and c) how thinking about vision and action together can lead to more effective solutions for the classical problem of visual navigation.