Description
For model optimization, we introduce quantization techniques to optimize inference-time compute and memory requirements. I-BERT optimizes compute by leveraging integer-only quantization, which achieves up to a 3.5X latency speedup and enables deployment of the Transformer architectures on integer-only hardware. SqueezeLLM, which employs extremely low-bit weight quantization, effectively reduces memory requirements without sacrificing accuracy during LLM inference. For enhanced inference methods, we present the Big Little Decoder, a speculative decoding framework that accelerates autoregressive LLM inference by up to 2X through a collaboration between small and large models. Regarding model architectures, we propose an efficient design for speech recognition using a Temporal U-Net structure, which improves inference efficiency by shortening input sequence lengths. Finally, at the application level, we introduce LLMCompiler, a framework for efficiently orchestrating multiple function calls in LLM-based applications, which reduces execution latency and costs while enhancing robustness by decomposing complex user inputs into smaller, easier tasks. Collectively, these contributions provide a full-stack strategy for optimizing AI model inference from low-level systems to high-level applications to enable the efficient deployment and serving of state-of-the-art AI solutions