[Colloquium] Fang/Dissertation Defense/Jul 3, 2018

Margaret Jaffey via Colloquium colloquium at mailman.cs.uchicago.edu
Tue Jun 19 09:37:39 CDT 2018



       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Aiman Fang

Date:  Tuesday, July 3, 2018

Time:  10:00 AM

Place:  Ryerson 255

Title: ABFR: CONVENIENT MANAGEMENT OF LATENT ERROR RESILIENCE USING
APPLICATION KNOWLEDGE

Abstract:
Supercomputers continue to increase in scale and complexity to meet
the demands of science and engineering. Exascale systems face high
error rates due to increasing scale (10^9 cores), software complexity
and rising memory error rates. Increasingly, errors escape immediate
hardware-level detection, silently corrupting application states. Such
latent errors can often be detected by application-level tests but
typically at long latencies. Challenges for latent errors include,
determining when the error occurred, what data was corrupted, and how
to efficiently recover. The predicted high error rates and latent
errors are a critical problem that will limit the scale of application
science.

We propose a new approach called Application-Based Focused Recovery
(ABFR), that defines the application knowledge needed for efficient
latent error recovery. This allows the application to pursue
strategies exploiting a range of application semantics within a
well-defined resilience framework. The ABFR runtime then exploits this
knowledge to achieve efficient latent error tolerance. ABFR enables
application designers to express resilience without concern for the
underlying architectures and systems. Together, these ABFR properties
support flexible application-based resilience.

To demonstrate its generality, we apply ABFR to three varied
scientific computation archetypes (stencil, N-Body tree, and Monte
Carlo particle transport). We design ABFR operators for each
computation and evaluate the performance of ABFR. We measure latent
error resilience performance for varied error rates; results indicate
significant reductions in error recovery cost (up to 367x) and
recovery latency (up to 24x). ABFR achieves efficient and scalable
recovery at scale with high latent error rates for all three
computations.

Aiman's advisor is Prof. Andrew Chien

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://www.cs.uchicago.edu/phd/phd_announcements#aimanf

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list