[Colloquium] Reminder: Leesatapornwongsa/Dissertation Defense/Jun 6, 2017

Mon Jun 5 11:25:41 CDT 2017

This is a reminder about Tanakorn's defense tomorrow.

       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***

Candidate:  Tanakorn Leesatapornwongsa

Date:  Tuesday, June 6, 2017

Time:  1:00 PM

Place:  Ryerson 277

Title: Unearthing concurrency and scalability bugs in cloud-scale
distributed systems

Abstract:
In the era of cloud computing, users (normal and organization) move
their data and computation from local machines and internal
datacenters to cloud services, thus the services are expected to be
24/7 dependable. Cloud services must be accessible anytime and
anywhere and not lose or corrupt users data, and scale as user base
continues to grow. Unfortunately, guaranteeing cloud services’
dependability is challenging because these cloud services are backed
by large sophisticated distributed systems such as scalable data
stores, data-parallel frameworks, and cluster management systems. Such
cloud-scale distributed systems remain difficult to get right because
they need to address data races among nodes, complex failures in
commodity hardware, tremendous user requests, and much more.
Addressing these cloud-specific challenges makes the systems getting
more complex and new intricate bugs continue to create dependability
problems. This dissertation tries to answer a vital question of cloud
dependability: “how can we make cloud-scale distributed systems
reaching ideal dependability?” We try to answer this question by
focusing on the problems of distributed concurrency bugs and
scalability bugs. We focus on these two problems because they are
novel issues that occur in cloud environment only and not many works
addressing them. Distributed concurrency bug (DC bug) is one of
unsolved reliability problem in cloud systems. DC bugs are caused by
non-deterministic order of distributed events such as message
arrivals, machine crashes, and reboots. Cloud systems execute multiple
complicated distributed protocols concurrently. The possible
interleavings of the distributed events are beyond developer’s
anticipations and some interleavings might not be handled properly
that can lead to catastrophic failures. To combat DC bugs, we make two
contributions. First, we advance state of the art of distributed
system model checking by introducing “semantic-aware model checking”
(SAMC). Distributed system model checkers (dmck) are used to test
system reliability of real systems. Existing dmcks however rarely
exercise multiple faults due to the state-space explosion problem, and
thus do not address present reliability challenges of cloud systems in
dealing with complex faults. SAMC pushes the boundary of dmcks by
introducing a white-box principle that takes simple semantic
information of the target system and incorporates that knowledge into
state-space reduction policies. We show that SAMC can find deep bugs
one to two orders of magnitude faster compared to state-of-the-art
techniques. Second, we conduct a formal study on DC bugs to gain
foundation knowledge for future DC bug research. We study 104 DC bugs
from various widely-deployed cloud-scale distributed systems in many
characteristics along several axes of analysis such as the triggering
timing condition and input preconditions, error and failure symptoms,
and fix strategies. And we present the first complete taxonomy of DC
bugs, TaxDC. And for the second aspect of system dependability, we
focus on scalability bugs. Scale surpasses the limit of a single
machine in meeting users’ increasing demands for computing and
storage. On the negative side, scale creates new development and
deployment issues. Developers must ensure that their algorithms and
protocol designs to be scalable. However, until real deployment takes
place, unexpected bugs in the actual implementations are unforeseen.
This new era of cloud-scale distributed systems has given birth to
“scalability bugs”, latent bugs that are scale-dependent, and only
surface in large scale. To address scalability bugs, we conduct a
study on scalability bugs to understand how they manifest and what
their root causes are and introduce SCk, a methodology that enables
developers to scale-check distributed systems and finds scalability
bugs economically on one machine. SCk helps developers identify
potential buggy code and allows developers to colocate a large number
of nodes to test the potential buggy code without sacrificing
accuracy. We remove a problem of hardware contentions (i.e. CPU,
memory, and thread) with four novel strategies, and we successfully
integrate SCk to Cassandra, HDFS, Riak, and Voldemort. With SCk, we
achieve a high colocation factor (500 nodes) and can reproduce ten
scalability bugs and identify two new hidden bugs.

Tanakorn's advisor is Prof. Haryadi Gunawi

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://www.cs.uchicago.edu/phd/phd_announcements#tanakorn

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=