[Colloquium] Reminder - Yi He Dissertation Defense/Jan 10, 2023

Tue Jan 10 08:35:12 CST 2023

This is an announcement of Yi He's Dissertation Defense.
===============================================
Candidate: Yi He

Date: Tuesday, January 10, 2023

Time:  10 am CST

Location: JCL 298

Title: Resilient Deep Learning Accelerators

Abstract: Deep learning (DL) accelerators have been widely deployed in a broad range of application domains, ranging from edge computing, self-driving cars, to cloud services. Resilience against hardware failures is a top priority for these accelerators as hardware failures can lead to various undesirable consequences. For inference workloads, hardware failures can generate crashes and significant degradation in inference accuracy. For training workloads, hardware failures can result in failure to converge, low training/test accuracy, numerical errors (e.g., INF/NaNs), and crashes.

In this talk, we will discuss how major classes of hardware failures propagate and affect DL inference and training workloads, and present new detection and recovery techniques to mitigate hardware failures. First, we will introduce a novel technique that generates high-quality test programs, specifically targeting DL inference accelerators, to detect permanent hardware failures (e.g., early life failures, circuit aging) in the field. We demonstrate the efficacy of our technique using Nvidia's open-sourced accelerator, NVDLA, and show that our technique is capable of achieving high reliability requirements — even those mandated by safety-critical applications such as self-driving cars. Specifically, our technique successfully detects > 99.0% of stuck-at and transition faults (two representative fault models for permanent hardware failures) for the entire NVDLA design, compared to only < 80% if random test programs are used. Moreover, our technique requires minimal test time and test storage, and no hardware modification.

Second, we will discuss our open-source resilience analysis framework targeting transient hardware failures (e.g., soft errors, dynamic variations) in both the logic and memory components of DL inference accelerators, called FIdelity, as well as the first large-scale study on transient hardware failures in the logic components of DL inference accelerators using this framework. FIdelity enables accurate and quick analysis, which is achieved by leveraging unique architectural properties of DL accelerators to model hardware failures in software with high fidelity. We implement and thoroughly validate the FIdelity framework using NVDLA as the baseline DL accelerator. Using FIdelity, we perform 46M fault injection experiments running various representative deep neural network inference workloads. We thoroughly analyze the experiment results, fundamentally understand how the injected faults propagate, and obtain several new insights that can be used in designing efficient, resilient DL inference accelerators.

Lastly, we will focus on DL training workloads, and present (1) the first study that reveals the fundamental understanding on how hardware failures affect DL training workloads, and (2) new, light-weight hardware failure mitigation techniques. We extend the FIdelity framework to perform large-scale fault injection experiments targeting both transient and permanent hardware failures in DL training workloads, and conduct >2.9M experiments using a diverse set of DL training workloads. We characterize the outcomes of these experiments, thoroughly analyze the fault propagation paths, and derive the necessary conditions that must be satisfied for hardware failures to eventually cause unexpected training outcomes. Based on the necessary conditions, we develop ultra-light-weight software techniques to detect hardware failures and recover the workloads, which only require 24-32 lines of code change, and introduce 0.003-0.025% performance overhead for various representative neural networks. Our findings are validated by observations in real datacenter workloads, and our techniques can be readily deployed in practice.

Advisor: Yanjing Li

Committee members: Yanjing Li, Shan Lu, and Michael Maire

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20230110/47854b13/attachment.html>