[Colloquium] Liu/Dissertation Defense/Jun 5, 2019

Margaret Jaffey margaret at cs.uchicago.edu
Wed May 22 15:36:36 CDT 2019



       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Haopeng Liu

Date:  Wednesday, June 5, 2019

Time:  1:00 PM

Place:  John Crerar Library (JCL) 298

Title: MODELING AND TACKLING TIMING BUG IN MULTI-THREADED SYSTEMS AND
DISTRIBUTED SYSTEMS

Abstract:
Multi-threaded software and distributed cloud software are prevalent
as a dominant backbone for modern applications. Although it is
extremely important, their reliability is severely threatened by
software bugs. Among all types of software bugs, timing bugs are among
the most troublesome to tackle due to the inherent non-deterministic
nature and the huge interleaving space. To explore all large numbers
of possible interleavings in this space, models of timing bugs are
critical and a timing bug model includes five components:
nondeterminism, causality relationship, synchronization, shared data,
and bug pattern. To fight timing bugs and improve concurrent software
reliability, this dissertation focuses on three aspects of timing bug
models. (1) non-determinism: existing models focus on thread
interleavings in multi-threaded programs. (2) synchronization:
existing models mainly look into lock-related synchronization
primitives like lock and condition variable. (3) shared data: existing
models mainly target shared global memory. However, existing models
cannot cover new situation of timing bugs and guide the fighting.
Specifically, (1) non-determinism: message interleaving and random
hardware faults are new type of nondeterminism in distributed systems
beyond traditional thread interleaving. Meanwhile, these new types of
non-determinism introduces new causality relationship. How to
precisely model them are challenging and urgent. (2) synchronization:
in the manually designed patches of timing bugs in multi-threaded
programs, it's a common strategy to leverage non-lock synchronization
primitives to enforce the timing relationship and x bugs. How to model
this strategy and design new xing tools are desired. (3) shared data:
Whether shared global memory is single dominant data type that timing
bugs are racing on? If not, what are new types of shared data? Answers
to these questions are critical to improve the software system
reliability. This dissertation works on these three directions and
makes the following contributions: First, in the non-determinism
direction, this thesis proposes a new model for message timing bugs
and fault timing bugs in distributed systems. Different from
traditional timing bugs in multi-threaded systems, they are caused by
the non-determinism of message interleaving and random hardware
faults. Our work models these two new types of non-determinism and the
new causality relationship introduced by them. Guided by the proposed
models, detection tools are designed to predict message timing bugs
and fault timing bugs from correct runs. Each step of the detection
tool is carefully customized to address the unique challenges for
timing bugs in distributed systems. The evaluation result shows that
our tool can effectively and efficiently detect message timing bugs
and fault timing bugs with low false positive rates. Second, in the
synchronization direction, we first conducts an empirical study of
manual patches for real-world timing bugs in multi-threaded programs
to understand the gap between automatically generated patches and
manually generated patches. The study finds that (a) lock is the
dominant synchronization primitives for enforcing atomicity;
lock-related signals/waits are not the dominant primitive for
enforcing pairwise ordering in patches. (b) leveraging existing
synchronization in software is as common as adding extra primitives.
Motivated by these findings, we design a xing tool to model and
enforce timing relationship by leveraging non-lock synchronization
primitives. The patch generated by our tool has matching quality with
manual patches. Third, in the shared data direction, this dissertation
conducts a comprehensive characteristic study on real-world incidents
in Microsoft Azure production-run cloud services to provide an
in-depth understanding what types of timing bugs escaped in-house
testing and how were they resolved in the cloud. The study reveals two
main fi ndings: (1) half of timing bugs in our study are racing on
persistent data instead of shared global memory variables; (2)
mitigation strategy, especially running-environment mitigation, is
widely used to resolve timing bug incidents in the cloud. These
findings provide a guidance for future efforts in this field.

Haopeng's advisor is Prof. Shan Lu

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://newtraell.cs.uchicago.edu/phd/phd_announcements#haopliu

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list