[Colloquium] Reminder: Fang/Dissertation Defense/Jul 3, 2018

Margaret Jaffey via Colloquium colloquium at mailman.cs.uchicago.edu
Mon Jul 2 09:17:55 CDT 2018


This is a reminder about Aiman's defense tomorrow.

       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Aiman Fang

Date:  Tuesday, July 3, 2018

Time:  10:00 AM

Place:  Ryerson 255

Title: APPLICATION-BASED FOCUSED RECOVERY (ABFR): CONVENIENT
MANAGEMENT OF LATENT ERROR RESILIENCE USING APPLICATION KNOWLEDGE

Abstract:
Supercomputers continue to increase in scale and complexity to meet
the demands of science and engineering. Extreme-scale systems face
high error rates due to increasing scale (10^9 cores), software
complexity and rising memory error rates. Increasingly, errors escape
immediate hardware-level detection, silently corrupting application
states. Such latent errors can often be detected by application-level
tests but typically at long latencies. Challenges for latent errors
include determining when the error occurred, what data was corrupted,
and how to recover efficiently. The predicted high error rates and
latent errors are a critical problem that will increase the cost and
may limit the scale of application science.

We propose a new approach called Application-Based Focused Recovery
(ABFR), that exploits application knowledge to focus the recovery on
only potentially corrupted data, achieving efficient latent error
recovery. ABFR is an application-system partnership for efficient
latent error tolerance. It first defines the application knowledge
needed for efficient latent error recovery, including identifying the
potential root causes of the error and focusing recovery
intelligently. This allows the application to pursue strategies
exploiting a range of application semantics within a well-defined
resilience framework. The ABFR runtime then exploits this knowledge to
achieve efficient latent error tolerance. ABFR enables application
designers to express resilience without concern for the underlying
architectures and systems. It provides a clear separation between
application knowledge and underlying systems. Together, these ABFR
properties support flexible application-based resilience.

To demonstrate its generality, we apply ABFR to three varied
scientific computation archetypes (stencil, N-Body tree, and Monte
Carlo particle transport). We design ABFR operators for each
computation and evaluate the performance of ABFR. We measure latent
error resilience performance for varied error rates. Results indicate
ABFR significantly improves recovery performance. Specifically, ABFR
reduces error recovery cost by 2.4x to 367x, recovery latency by 2.2x
to 24x) and I/O cost up to 1000x. ABFR achieves efficient and scalable
recovery at scale with high latent error rates for all three
computations.

This thesis demonstrates a new approach for efficient, scalable latent
error recovery on large-scale systems. ABFR enables flexible
application-based error resilience and provides sophisticated runtime
support. As a result, applications are able to tolerate 1000-fold
higher error rates.

Aiman's advisor is Prof. Andrew Chien

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://www.cs.uchicago.edu/phd/phd_announcements#aimanf

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list