Language is such a powerful representation for capturing the knowledge and information about our world. It excels at expressing discrete concepts such as objects and
their attributes, the relationships between them in a very compact manner all due to
its extremely high level of abstraction. Language is the primary means by which we
communicate, comprehend, and express our thoughts and ideas, and it lies at the very
core of human intelligence. With the advent of powerful generative models, machines
also have begun to comprehend and generate natural language with notable fluency
and creativity. However, they lack “grounding”—a direct tie to the visual world.
Vision plays a pivotal role in our comprehension and production of language. When
we describe a scene, understand instructions, or engage in a dialogue, visual context
significantly aids our interpretation and generation of language. This highlights the
need for integrating vision for generative modeling.
Chapter 1 and 2 delve into image-to-text domain, spotlighting the importance of
a multimodal approach for text generation. In Chapter 1, we explore how generating
textual rationales with attention visualizations can enhance model transparency for
visual question answering. In Chapter 2, we build generative models that abandon
traditional left-to-right sequencing in favor of an unsupervised technique to determine
optimal generation orders. Chapter 3 and 4 shift the focus to text-to-image generation.
In Chapter 3, we introduce a training-free framework that combines linguistic cues
with reference images, allowing for controllable image synthesis using denoising
diffusion probabilistic models. Lastly, Chapter 4 emphasizes the importance of
preserving object shapes in text-based image editing, proposing a unique mechanism
that augments text-to-image models to be more faithful to input masks and text
Prompts.
Details
Title
Vision and Language Understanding Through Generative Modeling
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).