[CS] Re: Zain Sarwar MS PresentationJun 26, 2025

Thu Jun 12 10:06:54 CDT 2025

Please remove me from this mailing list. I've tried without success to unsubscribe multiple times.

Thank you,

Lucas Awad<https://www.linkedin.com/in/lucasawad/>
University of Chicago '28
B.A. Economics, Computer Science
(773) 322 2429 | lawad at uchicago.edu

________________________________
From: cs <cs-bounces+lawad=cs.uchicago.edu at mailman.cs.uchicago.edu> on behalf of via cs <cs at mailman.cs.uchicago.edu>
Sent: Thursday, June 12, 2025 7:00 PM
To: cs at cs.uchicago.edu <cs at cs.uchicago.edu>; colloquium at cs.uchicago.edu <colloquium at cs.uchicago.edu>
Subject: [CS] Zain Sarwar MS PresentationJun 26, 2025

This is an announcement of Zain Sarwar's MS Presentation
===============================================
Candidate: Zain Sarwar

Date: Thursday, June 26, 2025

Time:  9 am CST

Remote Location: https://uchicago.zoom.us/j/94869095059?pwd=DV4gGttLyWg6qZANktVJRZXV60BL1N.1

Title: Continual Pretraining, Dense Backpropagation, and Hierarchical Routing in Mixture of Experts

Abstract: MoEs have emerged as a powerful scaling strategy for LLMs, enabling sparse activation of parameters to achieve improved compute efficiency and performance. However, their adoption introduces a unique set of challenges across training stability, scaling dynamics, and continual learning. In this thesis, we present three contributions aimed at advancing the robustness and scalability of MoE models.
First, we explore the continual pretraining of MoE transformers, investigating whether sparse routing mechanisms hinder adaptation to new data. Our findings demonstrate that with appropriate strategies, MoEs maintain their sample efficiency and expert balance across distribution shifts—offering practical alternatives to full model retraining. Second, we address a fundamental limitation of sparse learning: the router's exposure to only partial gradient signals. We introduce DeaultMoE, a method that approximates dense gradients via expert-wise EMAs, yielding improved pre-training efficiency without sacrificing sparsity. Third, we propose StructMoE, a hierarchical architecture that augments each expert with multiple low-rank submodules selected via a secondary router. This design introduces dynamic intra-expert routing, enabling structured parameter growth and improved expressivity. Empirical results demonstrate superior performance over standard MoEs.

Advisors: Michael Maire

Committee Members: Michael Maire, Risi Kondor, and Chenhao Tan

When unsubscribing, use your cnetid at cs.uchicago.edu address if your cnetid at uchicago.edu does not work.

cs mailing list  -  cs at mailman.cs.uchicago.edu
Edit Options and/or Unsubscribe: https://mailman.cs.uchicago.edu/mailman/listinfo/cs
More information here: https://howto.cs.uchicago.edu/techstaff:mailinglist
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/cs/attachments/20250612/c257daab/attachment-0001.html>