Description
We live in a world made up of different objects, people, and environments interacting with each other: people who work, write, eat, and drink; vehicles that move on land, water, or in the air; rooms that are furnished with chairs, tables, and carpets. This vast amount of information can be easily collected from the recorded videos and photographs shared online. However, it still remains a challenge to teach an intelligent machinery agent to reliably analyze and understand this extensive collection of data. Generative models are one of the most compelling methods towards modeling visual realism from the large corpus of available images, which operate by teaching a machine to create new contents. These models are not only beneficial in understanding the visual world, but more deeply in visual synthesis and content creation. They can assist human users in manipulating and editing an existing visual content. In the last few years, Generative Adversarial Networks (GANs) as an important type of generative models have made remarkable enhancements in learning complex data manifolds by generating data points from scratch. The GAN training procedure pits two neural networks against each other, a generator and a discriminator. The discriminator is trained to distinguish between the real samples and the generated ones. The generator is trained to fool the discriminator into thinking its outputs are real. The network learns the real-world distribution while generating high-quality images, translating a text phrase into an image, or transforming images from one domain to another. This dissertation investigates algorithms to improve the performance of such models in creating new visual content specifically in structural and compositional domains in a wide range from hand-designed fonts to natural complex scenes. In Chapter 2, we consider text as a visual element and propose tools to synthesize new glyphs in a font domain and transfer the style of the seen characters to the generated ones. From Chapter 3, we focus on the domain of natural images and propose GAN models capable of synthesizing complex scene images with lots of variations in the number of objects, their locations, shapes, etc. In Chapter 4, we explore the role of compositionality in the GAN frameworks and propose a new method to learn a function that maps images of different objects sampled from their marginal distributions into a combined sample that captures the joint distribution of object pairs. Despite all the improvements in training GANs, it still remains a challenge to fully optimize the GAN generator in a two-player adversarial game, resulting in samples that do not always follow the target distribution. In Chapter 5, instead of trying to improve the training procedure, we propose an approach to improve the quality of the trained generator by post-processing its generated samples using information from the optimized discriminator.