[Colloquium] Reminder - Yi He Candidacy Exam/Oct 13, 2021

meganwoodward at uchicago.edu meganwoodward at uchicago.edu
Wed Oct 13 08:57:32 CDT 2021


This is an announcement of Yi He's Candidacy Exam.
===============================================
Candidate: Yi He

Date: Wednesday, October 13, 2021

Time:  1 pm CST

Remote Location: https://uchicago.zoom.us/j/98453897294?pwd=KzFsaVF3bk5ZSWtpTVB2cFYxa2grQT09

Location: Crerar 298

Title: Hardware Errors or Software Bugs? Debugging Unexpected Outcomes in Deep Neural Network Training Workloads

Abstract: Debugging unexpected outcomes in deep neural network (DNN) training workloads is a critical challenge. Given the massive complexity of advanced DNNs and the large number of training devices, DNN training workloads are susceptible to not only software bugs, but also hardware errors, which has already been observed in real industry workloads. However, there is no prior work that thoroughly studies the impacts of hardware errors on DNN training workloads and/or offers solutions to counteract these impacts, suggesting an urgent and important need to fill this gap.

In this work, our first goal is to accurately and comprehensively characterize the behaviors of hardware errors. We achieve this goal by leveraging our previous work to create accurate software fault models for hardware errors, and performing large-scale fault injection study using these software fault models. Our initial results show that hardware errors can lead to a wide range of unexpected DNN training outcomes, many of which overlap with those caused by software bugs.

Therefore, the crucial first step to successfully and efficiently debug unexpected DNN training outcomes is to determine whether the root cause is hardware errors or software bugs. To this end, the second goal of this work is to devise efficient techniques to identify the root cause (hardware errors or software bugs) of these unexpected outcomes, so that manual debug efforts can be efficiently spent on software bugs. We will investigate various techniques across different system abstraction layers (spanning algorithm, software, run-time, hardware architecture), such that the optimal trade-offs between the accuracy of the root causes and performance/energy overhead can be achieved.

Advisors: Yanjing Li

Committee: Shan Lu, Michael Maire, and Yanjing Li



More information about the Colloquium mailing list