Description
This dissertation presents three contributions on unsupervised learning. First, I describe a signal representation framework called the sparse manifold transform that combines key ideas from sparse coding, manifold learning, and slow feature analysis. It turns non-linear transformations in the primary sensory signal space into linear interpolations in a representational embedding space while maintaining approximate invertibility. The sparse manifold transform is an unsupervised and generative framework that explicitly and simultaneously models the sparse discreteness and low-dimensional manifold structure found in natural scenes. When stacked, it also models hierarchical composition. I provide a theoretical description of the transform and demonstrate properties of the learned representation on both synthetic data and natural videos. Further, the SMT also provides a unifying geometric perspective on the role of simple and complex cells. I propose localized SMT in which neurons in the two layers correspond to simple and complex cells. This provides a new functional explanation for these cell types: Simple cells can be viewed as representing a discrete sampling of a smooth manifold in the sensor space, while complex cells can be viewed as representing localized smooth linear functions on the manifold. While each individual complex cell pools from a local region, together they tile (Sengupta et al. 2018) the manifold and build an untangled population representation (DiCarlo & Cox 2007), which tends to preserve the identity of the signal while straightening the transformations (Henaff et al. 2017). In the localized SMT, the complex cell layer is learned in an unsupervised manner based on a diffusion process. The results demonstrate that simple and complex cells are emergent properties in a neural system that is optimized to learn the manifold structure of dynamic sensory inputs and is subject to sparse connectivity constraints.
The second contribution is on the discovery of word factors. Co-occurrence statistics based word embedding techniques have proved to be very useful in extracting the semantic and syntactic representation of words as low dimensional continuous vectors. This work discovered that dictionary learning can open up these word vectors as a linear combination of more elementary word factors. I demonstrate many of the learned factors have surprisingly strong semantic or syntactic meaning corresponding to the factors previously identified manually by human inspection. Thus dictionary learning provides a powerful visualization tool for understanding word embedding representations. Furthermore, I show that the word factors can help in identifying key semantic and syntactic differences in word analogy tasks and improve upon the state-of-the-art word embedding techniques in these tasks by a large margin.
The third contribution is on a more efficient and effective way to train energy-based models (EBMs) in high-dimensional spaces. Energy-based models assign unnormalized log-probability to data samples. This functionality has a variety of applications, such as sample synthesis, data denoising, sample restoration, outlier detection, Bayesian reasoning, and many more. But training of EBMs using standard maximum likelihood is extremely slow because it requires sampling from the model distribution. Score matching potentially alleviates this problem. In particular, denoising score matching (Vincent 2011) has been successfully used to train EBMs. Using noisy data samples with one fixed noise level, these models learn fast and yield good results in data denoising (Saremi & Hyvarinen 2019). However, demonstrations of such models in high-quality sample synthesis of high dimensional data were lacking. Recently, (Song & Ermon 2019) have shown that a generative model trained by denoising score matching accomplishes excellent sample synthesis, when trained with data samples corrupted with multiple levels of noise. Both analysis and empirical evidence show that training with multiple noise levels is necessary when the data dimension is high. Leveraging this insight, a novel EBM trained with multi-scale denoising score matching is proposed. The model exhibits data generation performance comparable to state-of-the-art techniques such as GANs, and sets a new baseline for EBMs. The proposed model also provides density information and performs well in an image inpainting task.