[CS] Re: Zain Sarwar MS PresentationJun 26, 2025
Lucas Awad via cs
cs at mailman.cs.uchicago.edu
Thu Jun 12 10:06:54 CDT 2025
Please remove me from this mailing list. I've tried without success to unsubscribe multiple times.
Thank you,
Lucas Awad<https://www.linkedin.com/in/lucasawad/>
University of Chicago '28
B.A. Economics, Computer Science
(773) 322 2429 | lawad at uchicago.edu
________________________________
From: cs <cs-bounces+lawad=cs.uchicago.edu at mailman.cs.uchicago.edu> on behalf of via cs <cs at mailman.cs.uchicago.edu>
Sent: Thursday, June 12, 2025 7:00 PM
To: cs at cs.uchicago.edu <cs at cs.uchicago.edu>; colloquium at cs.uchicago.edu <colloquium at cs.uchicago.edu>
Subject: [CS] Zain Sarwar MS PresentationJun 26, 2025
This is an announcement of Zain Sarwar's MS Presentation
===============================================
Candidate: Zain Sarwar
Date: Thursday, June 26, 2025
Time: 9 am CST
Remote Location: https://uchicago.zoom.us/j/94869095059?pwd=DV4gGttLyWg6qZANktVJRZXV60BL1N.1
Title: Continual Pretraining, Dense Backpropagation, and Hierarchical Routing in Mixture of Experts
Abstract: MoEs have emerged as a powerful scaling strategy for LLMs, enabling sparse activation of parameters to achieve improved compute efficiency and performance. However, their adoption introduces a unique set of challenges across training stability, scaling dynamics, and continual learning. In this thesis, we present three contributions aimed at advancing the robustness and scalability of MoE models.
First, we explore the continual pretraining of MoE transformers, investigating whether sparse routing mechanisms hinder adaptation to new data. Our findings demonstrate that with appropriate strategies, MoEs maintain their sample efficiency and expert balance across distribution shifts—offering practical alternatives to full model retraining. Second, we address a fundamental limitation of sparse learning: the router's exposure to only partial gradient signals. We introduce DeaultMoE, a method that approximates dense gradients via expert-wise EMAs, yielding improved pre-training efficiency without sacrificing sparsity. Third, we propose StructMoE, a hierarchical architecture that augments each expert with multiple low-rank submodules selected via a secondary router. This design introduces dynamic intra-expert routing, enabling structured parameter growth and improved expressivity. Empirical results demonstrate superior performance over standard MoEs.
Advisors: Michael Maire
Committee Members: Michael Maire, Risi Kondor, and Chenhao Tan
When unsubscribing, use your cnetid at cs.uchicago.edu address if your cnetid at uchicago.edu does not work.
cs mailing list - cs at mailman.cs.uchicago.edu
Edit Options and/or Unsubscribe: https://mailman.cs.uchicago.edu/mailman/listinfo/cs
More information here: https://howto.cs.uchicago.edu/techstaff:mailinglist
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/cs/attachments/20250612/c257daab/attachment-0001.html>
More information about the cs
mailing list