I propose a series of models based on the idea of factoring apart image sequences into two types of latent variables: a stable percept, and a low dimensional time-varying representation of its transformation. Such a two component model is a general mechanism for teasing apart the causes that conspire to produce a time-varying image. First, I show that when both components are represented by linear expansions, the resulting bilinear model can achieve some degree of image stabilization by utilizing the transformation model to explain the translation motions that occur in a small window of a movie. Yet, the recovered latent factors exhibit dependencies that motivate the investigation of a richer, exponential map as a second model for the dynamics of appearance. In addition to the translation motions captured by the linear appearance model, this richer model learns transformations that can compensate for rotations, expansions, and complex distortions in the data. Lastly, I propose a hierarchical model that describes images in terms of a hierarchy of grouped lower-level features; learning parameters in this hierarchy is enabled by a procedure that maintains uncertainty in the posterior distributions over the latent variables.
The contribution of this work is a demonstration of an adaptive mechanism that can automatically learn transformations in a structured model, which enables sources of variability to be factored out by inverting it. This is an important step, because sources of variability are the main factor causing difficulties in artificial object recognition systems, and visual invariance is also closely related to the idea of generalization, an ability that is commonly equated with intelligence. Thus, to the extent that we are able to build seeing machines that can automatically compensate for category-level variability we will have achieved some part of the goal of artificial intelligence.