[Colloquium] Re: Greg Pauloski MS Presentation/Mar 23, 2022

Tue Mar 22 11:09:59 CDT 2022

Please remove me from the list

Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: cs <cs-bounces+sunalber=cs.uchicago.edu at mailman.cs.uchicago.edu> on behalf of Megan Woodward <meganwoodward at uchicago.edu>
Sent: Tuesday, March 22, 2022 8:12:19 AM
To: cs at cs.uchicago.edu <cs at cs.uchicago.edu>; colloquium at cs.uchicago.edu <colloquium at cs.uchicago.edu>
Subject: [CS] Greg Pauloski MS Presentation/Mar 23, 2022

This is an announcement of Greg Pauloski's MS Presentation
===============================================
Candidate: Greg Pauloski

Date: Wednesday, March 23, 2022

Time:  3 pm CST

Remote Location: https://uchicago.zoom.us/j/96526880582?pwd=V2pjQ3pvdFk3cWFNWWJyNW9SUnVDZz09

M.S. Paper Title: Scalable Deep Neural Network Training with Distributed K-FAC

Abstract: Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent (SGD) and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, has recently been shown to converge with fewer iterations in deep neural network (DNN) training than SGD; however, K-FAC’s larger memory footprint and increased communication necessitates careful distribution of work for efficient usage. This thesis investigates scalable K-FAC algorithms to understand K-FAC’s applicability in large-scale deep neural network training and presents KAISA, a K-FAC-enabled, Adaptable, Improved, and Scalable second-order optimizer framework. Specifically, layer-wise distribution strategies, inverse-free second-order gradient evaluation, dynamic K-FAC update decoupling, and more are explored with the goal of preserving convergence while minimizing training time. KAISA can adapt the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability, and this adaptable distribution scheme generalizes existing strategies while providing a framework for scaling second-order methods beyond K-FAC. Compared to the original optimizers, KAISA converges 18.1–36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.

Advisors: Kyle Chard and Ian Foster

Committee Members: Kyle Chard, Ian Foster, and Zhao Zhang

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20220322/92b229c0/attachment.html>