[CS] Zain Sarwar MS PresentationJun 26, 2025
via cs
cs at mailman.cs.uchicago.edu
Thu Jun 12 10:00:27 CDT 2025
This is an announcement of Zain Sarwar's MS Presentation
===============================================
Candidate: Zain Sarwar
Date: Thursday, June 26, 2025
Time: 9 am CST
Remote Location: https://uchicago.zoom.us/j/94869095059?pwd=DV4gGttLyWg6qZANktVJRZXV60BL1N.1
Title: Continual Pretraining, Dense Backpropagation, and Hierarchical Routing in Mixture of Experts
Abstract: MoEs have emerged as a powerful scaling strategy for LLMs, enabling sparse activation of parameters to achieve improved compute efficiency and performance. However, their adoption introduces a unique set of challenges across training stability, scaling dynamics, and continual learning. In this thesis, we present three contributions aimed at advancing the robustness and scalability of MoE models.
First, we explore the continual pretraining of MoE transformers, investigating whether sparse routing mechanisms hinder adaptation to new data. Our findings demonstrate that with appropriate strategies, MoEs maintain their sample efficiency and expert balance across distribution shifts—offering practical alternatives to full model retraining. Second, we address a fundamental limitation of sparse learning: the router's exposure to only partial gradient signals. We introduce DeaultMoE, a method that approximates dense gradients via expert-wise EMAs, yielding improved pre-training efficiency without sacrificing sparsity. Third, we propose StructMoE, a hierarchical architecture that augments each expert with multiple low-rank submodules selected via a secondary router. This design introduces dynamic intra-expert routing, enabling structured parameter growth and improved expressivity. Empirical results demonstrate superior performance over standard MoEs.
Advisors: Michael Maire
Committee Members: Michael Maire, Risi Kondor, and Chenhao Tan
More information about the cs
mailing list