[CS] Yihua Cheng Dissertation Defense/Feb 14, 2025

Wed Feb 12 12:01:23 CST 2025

This is an announcement of Yihua Cheng's Dissertation Defense.
===============================================
Candidate: Yihua Cheng

Date: Friday, February 14, 2025

Time:  3 pm CST

Remote Location: https://uchicago.zoom.us/j/3150893650?pwd=RFgxNEE2MUQ0QzlsVXF3Ym94bDQ2Zz09

Location: JCL 298

Title: A Scalable Approach to Distributed Large Language Model Inference

Abstract: As the use of large language models (LLMs) expands rapidly, so does the intensity and scale of the workloads required to query LLMs. Thus, the requirements for serving LLMs evolve beyond single-instance deployment to large-scale distributed deployment. As most of today's LLM serving system optimizations focus only on speeding up a single serving instance, key techniques for distributed LLM deployments are still missing.

The key contribution of this dissertation is the design and implementation of an efficient system for distributed LLM serving engine deployment. Our thesis is that by decoupling the inference states (KV caches) from the LLM serving engine, the performance of distributed LLM inference can be substantially improved. Unlike prior work, which treats the LLM serving engine as a black box and builds global orchestrators serving engines, our approach uses a separate module to transfer, store, and share the KV caches across different serving engines. 

To prove this thesis, this dissertation provides a suite of techniques to address the following fundamental challenges. First, we need an efficient way to offload the KV caches from serving engines and insert them back. Second, we need a scalable solution that is compatible with heterogeneous hardware and infrastructure stacks.

Our key insight is that many KV caches can be reused across serving engines in the large-scale distributed setup. With the reused KV cache, the LLM can skip computationally intensive prefills and start generating outputs immediately. We have developed algorithms to properly transmit, store, and reuse the KV caches in the distributed serving system. We have shown that our solution can substantially reduce the computational cost on real-world workloads, as well as provide better production-ready guarantees, including faster auto-scaling and higher service availability. We have also open-sourced our solution at {https://github.com/vllm-project/production-stack}.

Advisor: Junchen Jiang

Committee Members: Junchen Jiang, Kexin Pei, Hui Zhang