[CS] REMINDER: Meng Wang Dissertation Defense/Oct 28th

Mon Oct 28 11:38:56 CDT 2024

This is an announcement of Meng Wang's Dissertation Defense.
===============================================
Candidate: Meng Wang

Date: Monday, October 28,2024

Time: 3:30pm – 5:30pm CT

Location: JCL 390

Remote Location: https://urldefense.com/v3/__https://uchicago.zoom.us/j/94589925163?pwd=AXfDKRnKw4Q2ljZoz4DJhp47CNtbIh.1__;!!BpyFHLRN4TMTrA!-obScT6gJot52jPZjIJ7ZBtxxXv3s16mBur_DBroxTX1NqEYW08M3OfOBkJpwEd05hjbMwRiJWbWCSRunoBTpJBRR_vbmh9WCys$

Title: Multi-Level Erasure Coded Storage Design and Its Relationship to Deep Learning Workloads

Abstract: Large-scale data centers store vast amounts of user data across a large number of disks, necessitating redundancy mechanisms such as erasure coding (EC) to protect against disk failures. As storage systems scale in size, complexity, and layering, the frequency of disk failures increases, and the time required to rebuild failed disks grows longer. For managing tens or hundreds of thousands of disks, traditional single-level erasure coding (SLEC) does not scale well, as it struggles to balance repair overhead with rack- and enclosure-level failure tolerance. Multi-level erasure coding (MLEC), which applies EC at both network and local levels, has been deployed in large-scale systems. However, there has been no in-depth study of its design considerations at scale, and many key research questions remain unaddressed. This dissertation aims to provide a comprehensive analysis of MLEC at scale, with a focus on its design considerations and relationship to deep learning (DL) workloads.

We begin by presenting a detailed analysis of MLEC's design space, considering multiple dimensions including various code parameter selections, chunk placement schemes, and repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.

We then discuss the relationship between MLEC and DL workloads. As DL workloads become increasingly data-intensive, their training datasets often exceed local storage capacity, necessitating access to remote erasure-coded storage systems. To cost-effectively evaluate MLEC's ability to meet the throughput demands of DL workloads, we develop an emulation-based approach. We introduce GPEmu, a GPU emulator designed to facilitate efficient prototyping and evaluation of deep learning systems without the need for physical GPUs. GPEmu supports over 30 DL models and 6 GPU models, providing capabilities for time emulation, memory emulation, distributed system support, and GPU sharing. We also develop MLECEmu, an emulator that simulates the read throughput of erasure-coded disk arrays with I/O-throttled in-memory file systems. Leveraging these tools, we demonstrate that MLEC storage can improve GPU utilization through wider stripes and that our optimized MLEC repair methods can reduce the duration of performance degradation caused by disk failures.

While MLEC storage provides high aggregated intra-cluster read throughput for DL workloads, the network bandwidth between the GPU cluster and the MLEC storage cluster can become a bottleneck during deep learning training, as inter-cluster bandwidth is typically more constrained than intra-cluster bandwidth. Recognizing that many samples significantly reduce in size during data preprocessing, we explore the selective offloading of preprocessing tasks to remote MLEC storage to mitigate data traffic. We conduct a case study to evaluate the potential benefits and challenges of this approach. Based on our findings, we propose SOPHON, a framework that selectively offloads preprocessing tasks at a fine granularity to reduce data traffic. SOPHON uses online profiling and adaptive algorithms to optimize the preprocessing of every sample in each training scenario. Evaluations using GPEmu and MLECEmu show that SOPHON reduces data traffic and training time by 1.2x to 2.2x compared to existing solutions.

Advisors: Haryadi Gunawi

Committee Members: Haryadi Gunawi, Junchen Jiang, John Bent, and Anjus George