Humans possess a remarkable ability to extract general object representations from a single image, capturing not only shape and texture, but also 3D form. In contrast, 3D reasoning in many computer vision systems is often limited. This thesis present three efforts aimed towards bridging this gap in 3D object perception. First we introduce a new dataset that focuses on real-world, object-centered 3D understanding. The dataset provides a diverse set of objects corresponding to real household objects, with varying geometries and physically-based rendering materials. It also includes additional annotations describing each object, making it a valuable resource for training and evaluating computer vision models. Next, we design a method for automatically inferring the articulation of 3D objects. The method enables the interaction of 3D objects and can be used to generate more realistic and dynamic scenes. By understanding how different parts of an object move and interact, computer vision systems can better model and reason about complex 3D scenes in simulation. Finally, we investigate the effectiveness of contrastive learning with 3D data augmentation to generate multiple views of objects, a departure from the typical method of training single view images. We show that generating multiple views of objects can help computer vision systems learn better representations and improve their overall object understanding in terms of classification and shape perception. These contributions represent efforts towards bridging the gap between human and machine 3D object perception, ultimately enabling them to understand 3D objects from single images in ways that are more aligned with human perception.




Download Full History