Scene representation, the process of converting visual data into efficient, accurate features, is essential for the development of general robot intelligence. This task is drawn from human experience, as humans generally take in a novel scene by indicating important features and objects in the scene before planning their actions around these features. Recently, the Generative Query Network (GQN) was developed, which takes in random viewpoints of a scene, constructs an internal representation, and uses the representation to predict the image from an arbitrary viewpoint of the scene. GQNs have shown that it is possible to learn accurate representations of various scenes without human labels or prior domain knowledge, but one limiting factor that remains is the fact that the input viewpoints are chosen randomly. By training an agent to learn where to capture the input observations, we can supply the GQN with more useful and unique data. We show that an agent can learn through reinforcement learning (RL) to select input viewpoints that provide much more useful information than random inputs, leading to better representations, and thus more complete reconstructions, which may lead to improvements in tasks with complex environments.




Download Full History