Description
In this dissertation, we investigate the question of how 3D scenes should be represented, such that the representation can be effectively estimated from standard photographs and can then be used to synthesize images of the same scene from novel unobserved viewpoints. Recovering photorealistic scene representations from images has been a longstanding goal of computer vision and graphics, and has typically been addressed using representations from standard computer graphics pipelines, such as triangle meshes, which are not particularly amenable to end-to-end optimization for maximizing the fidelity of rendered images. Instead, we advocate for the use of scene representations that are specifically well-suited to being used in differentiable deep learning pipelines. We explore the efficacy of various representations for view synthesis tasks including synthesizing local views around a single input image, extrapolating views around a pair of nearby input images, and interpolating novel views from a set of unstructured images. We present scene representations that succeed at the aforementioned tasks, which share two common properties: they represent scenes as volumes and that they avoid the poor scaling properties of regularly-sampled voxel grids by using compressed or parameter-efficient volume representations.