Humans perceive the world through their eyes -- where the images formed on the retina are two-dimensional projections of the underlying three-dimensional world. Akin to human vision, the goal of computer vision, is to extract information about the 3D world from 2D images. A fundamental problem in computer vision is to extract the 3D structure underlying such 2D images. Even though this problem is mathematically ill-posed, the ambiguity can be resolved, either using multiple 2D views, or using priors about how the world is structured.

In this thesis, I present my work on high-fidelity 3D mesh reconstruction of humans and objects from 2D images. I discuss the more classical setting of optimizing a shape/texture using multiple image inputs, as well as how we can learn priors that enable mesh reconstruction even from a single image. Specifically, I first present work on multi-view 3D reconstruction, where we reconstruct meshes of an object given few images with noisy camera poses. Then, I continue with 3D reconstruction from single images, enabled by learning category-specific shape priors from natural image datasets. Finally, I focus on learning single-view 3D human reconstruction using big models and big data. Such robust 3D reconstruction of humans enables downstream applications like 3D tracking and action recognition.




Download Full History