[Colloquium] Suminto/Dissertation Defense/Oct 22, 2019

Margaret Jaffey margaret at cs.uchicago.edu
Tue Oct 8 10:22:54 CDT 2019



       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Riza Suminto

Date:  Tuesday, October 22, 2019

Time:  11:30 AM

Place:  John Crerar Library (JCL) 298

Title: MITIGATING CASCADING PERFORMANCE FAILURES AND OUTAGES IN CLOUD
SYSTEMS

Abstract:
Modern distributed systems ("cloud systems") have emerged as a
dominant backbone for many today's applications. As these systems
collectively become the "cloud operating system", users expect high
dependability including performance stability and availability. Small
jitter in system performance or minutes of service downtimes can have
huge impact on company and users satisfactory. In this dissertation,
we are tackling this challenges. We try to improve cloud system
dependability by mitigating disruptive cascading effect in the aspect
of performance stability and availability.

For performance reliability aspect, we focus on mitigating cascading
performance failure by improving tail tolerance of data-parallel
framework. One popular solution to reduce tail latency problem in
speculative execution (SE). Existing SE implementations such as in
Hadoop and Spark are considered quite robust. However, we found an
important source of tail latencies that current SE implementations
cannot handle graciously: node-level network throughput degradation.
We reveal the loopholes of current SE implementations under this
unique fault model, and how the problem can cascade to entire cluster.
We then address the problem using PBSE, a robust, path-based
speculative execution that employs three key ingredients: path
progress, path diversity, and path-straggler detection and
speculation.

For availability aspect, we try to improve cloud system availablity by
detecting and eliminating cascading outage bugs (CO bugs). CO bug is
bug that can cause simultaneous or cascades of failures to each of the
individual nodes in the system, which eventually leads to a major
outage. While hardware arguably is no longer a single point of
failure, our large-scale studies of cloud bugs and outages reveal that
CO bugs has emerged as new class of outage-causing bugs and single
point of failure in the software. We address CO bugs problem with
Cascading Outage Bugs Elimination (COBE) project. In this project, we:
(1) study the anatomy of CO bugs, (2) develop CO-bug detection tools
to unearth CO bugs.

Riza's advisor is Prof. Haryadi Gunawi

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://newtraell.cs.uchicago.edu/phd/phd_announcements#riza

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list