[Theory] [Talks at TTIC] 1/6 TTIC Colloquium: Kaifeng Lyu​, UC Berkeley

Brandie Jones via Theory theory at mailman.cs.uchicago.edu
Fri Jan 3 12:00:00 CST 2025


*When:*        Monday, January 6th at *11:30am CT*


*Where:       *Talk will be given *live, in-person* at

                       TTIC, 6045 S. Kenwood Avenue

                       5th Floor, Room 530


*Virtually:*  via Panopto (livestream
<https://uchicago.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=f5426f91-dc51-405a-a960-b248011c2bb1>
)


*Who: *         Kaifeng Lyu, UC Berkeley

*Title:*          Scaling Hyperparameters in Training Large Models with
Theoretical Insights
*Abstract:  *Training large models is both resource-intensive and
time-consuming, making it crucial to understand the quantitative
relationship between model performance and hyperparameters. In this talk, I
will present our works that leverage theoretical insights to tackle this
issue from multiple aspects. First, distributed training requires large
batch sizes to fully exploit data parallelism, but how should the learning
rate be tuned as the batch size changes? Our work studies the SDE
approximations for large-batch RMSprop and Adam and derives the Square Root
Scaling Rule (SRSR): batch size ~ sqrt(learning rate). Second, training
large models with many workers can introduce significant communication
overhead. Local gradient methods, such as Local SGD, address this by
allowing workers to compute locally for H steps before synchronizing with
others. I will discuss a curious case in ImageNet-scale supervised
learning: if H scales up quadratically as learning rate decays, Local SGD
can lead to higher test accuracy compared to standard SGD running for the
same number of steps. This quadratic scaling is backed by a sharpness-based
implicit bias analysis of Local SGD. Finally, for LLM pretraining, we
explore how to optimize learning rate schedules for any given training
horizon. Our recent paper proposes an empirical law that describes how
pretraining loss evolves with different learning rate schedules. By
minimizing the predicted final pretraining loss over feasible schedules, we
identify a schedule that outperforms the widely used cosine schedule.

*Short Bio*: Kaifeng Lyu is a Postdoctoral Research Fellow at the Simons
Institute for the Theory of Computing at UC Berkeley. He completed his
Ph.D. in Computer Science at Princeton University in 2024, where he was
advised by Sanjeev Arora. His research explores the theoretical and
scientific foundations of deep learning and large language models. In Fall
2025, he will join the Institute for Interdisciplinary Information Sciences
(IIIS) at Tsinghua University as a Tenure-Track Assistant Professor.

*Host:  <zhiyuanli at ttic.edu>Nati Srebro <nati at ttic.edu>*


-- 
*Brandie Jones *
*Executive **Administrative Assistant*
Toyota Technological Institute
6045 S. Kenwood Avenue
Chicago, IL  60637
www.ttic.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20250103/64436286/attachment.html>


More information about the Theory mailing list