Full Stack Approach for Efficient Deep Learning Inference

Kim, Sehoon

PDF

Description

Recent advancements in AI technologies have led to unprecedented growth in model sizes, particularly with the advent of large language models (LLMs). While these models have shown great capabilities in various domains, their exponential scaling has introduced significant inference-time overheads, such as increased memory requirements, latency, and computational costs, thereby making efficient deployment and serving challenging. This thesis addresses these challenges through a full-stack approach that enhances efficiency across four key components of the AI inference stack: model optimization, inference methods, model architectures, and applications.

For model optimization, we introduce quantization techniques to optimize inference-time compute and memory requirements. I-BERT optimizes compute by leveraging integer-only quantization, which achieves up to a 3.5X latency speedup and enables deployment of the Transformer architectures on integer-only hardware. SqueezeLLM, which employs extremely low-bit weight quantization, effectively reduces memory requirements without sacrificing accuracy during LLM inference. For enhanced inference methods, we present the Big Little Decoder, a speculative decoding framework that accelerates autoregressive LLM inference by up to 2X through a collaboration between small and large models. Regarding model architectures, we propose an efficient design for speech recognition using a Temporal U-Net structure, which improves inference efficiency by shortening input sequence lengths. Finally, at the application level, we introduce LLMCompiler, a framework for efficiently orchestrating multiple function calls in LLM-based applications, which reduces execution latency and costs while enhancing robustness by decomposing complex user inputs into smaller, easier tasks. Collectively, these contributions provide a full-stack strategy for optimizing AI model inference from low-level systems to high-level applications to enable the efficient deployment and serving of state-of-the-art AI solutions

Details

Title

Full Stack Approach for Efficient Deep Learning Inference

Creator

Kim, Sehoon, Author

Published

EECS Department, University of California at Berkeley, Berkeley, California, 12/12/24

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2024-210

Type

Text

Format

technical reports

Extent

154 p

Language

eng

Archive

The Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket