[CS] Reminder: [defense] Ke/Dissertation Defense/Nov 13, 2020

Tricia Baclawski pbaclawski at uchicago.edu
Thu Nov 12 11:19:11 CST 2020


https://uchicago.zoom.us/j/5458676655?pwd=MWJZYXIyZVJNeTlodmNRTksyaXh2QT09
Password: 894465

       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Huan Ke

Date:  Friday, November 13, 2020

Time:  1:00 PM

Place:  via zoom

Title: Data Survival under Catastrophic Failures

Abstract:
As we look toward exascale it is clear that high-capacity HPC storage
systems will incorporate the large populations of hard disk drives
that have previously only been deployed at cloud-level service
providers. Further, with the rapid increase in network performance,
the number of disks per storage server will need to be dramatically
increased to efficiently pair with current networking technology. With
the massive populations of disks integrated within local systems, the
probability for various correlated failures across a large number of
components becomes a critical concern in preventing data loss. To
guarantee the data survival under catastrophic failures, this
dissertation explores the system reliability perspective to empower
data protection schemes with robustness, flexibility, and hierarchy.

The first part of this dissertation strengthens existing data
protection schemes with higher fault tolerance. We present a novel
declustered parity, single-overlap declustered parity (SODP), that
ensures at most one overlapping disk between any two stripesets. This
maximizes the number of simultaneous disk failures tolerated without
increasing parity overhead and minimizes disk rebuild time by
balancing parity stripes across disks. Rather than making a trade-off
between fault tolerance and rebuild performance, SODP takes the first
step to achieve both high fault tolerance and rebuild performance. Our
evaluation shows that when compared to the state of the art, SODP can
achieve 30x improvements in the probability of data loss during
failure bursts and achieves similar data protection using only half as
much parity overhead.

The second part of this dissertation provides the flexibility to
understand how the interactions between fault tolerance and rebuild
performance together impact system reliability. We design a practical
and flexible tool, fractional-overlap declustered parity (FODP), to
explore the trade-offs between the number of failure domains and
rebuild performance. This gives us a fine-grained control to
accommodate different reliability requirements and system sizes.
Furthermore, we introduce FODP-Plus-One to add additional parity on
top of FODP data layout, which uniformly distributes data and parity
blocks across disks and map the given logical units into the specified
physical disks. Our detailed analysis shows that FODP is able to bring
forth up to 99% less probability of data loss in the presence of
various failure regimes and FODP-Plus-One yields up to 99% reduction
in granularity of data loss.

The third part of this dissertation is to further explore how
SODP/FODP can be integrated into the tiered parity, which layers two
levels of protection schemes on top of one another. With the tiered
architecture, storage architects are able to protect against new types
of failures, such as rack failures, or power and cooling failures
while existing flat parity schemes would definitely lose data.
However, existing tiered parity schemes work on two extremes, either
enhancing the rebuild performance or improving the fault tolerance,
few established principles exist to guide system designs to tolerate
both temporal and spatial correlated failures. This work
systematically explores the design space for balancing fault tolerance
and rebuild performance at each tier and evaluates how different data
protection techniques impact the probability of data loss under a
variety of failure regimes. Based on the analysis, we identify a set
of design principles that storage architects can use to tolerate
correlated failures. By applying these principles, we present a novel
tiered parity scheme, Tiered FODP (TFODP), where the top tier is
deployed with minimal FODP technique for high fault tolerance and the
bottom tier is designed with the maximal FODP to provide high rebuild
performance. Our evaluation shows that TFODP can achieve higher
reliability with much fewer storage overheads.

Huan's advisor is Prof. Haryadi Gunawi

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://newtraell.cs.uchicago.edu/phd/phd_announcements#huanke

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Tricia Baclawski
Student Affairs Administrator
Computer Science Department
5730 S. Ellis - Room 350
Chicago, IL 60637
pbaclawski at uchicago.edu
(773) 702-6854
/pronouns: she, her, hers/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the cs mailing list