[Colloquium] Rubenstein/MS Presentation/Feb 13, 2014

Margaret Jaffey margaret at cs.uchicago.edu
Thu Jan 30 11:33:14 CST 2014


This is an announcement of Zachary Rubenstein's MS Presentation.

------------------------------------------------------------------------------
Date:  Thursday, February 13, 2014

Time:  1:30 PM

Place:  Ryerson 276

M.S. Candidate:  Zachary Rubenstein

M.S. Paper Title: Error Checking and Snapshot-Based Recovery in a
Preconditioned Conjugate Gradient Solver

Abstract:
With some probability per time, each component in a computer system
can experience an intermittent hardware error, or soft error, due to a
number of factors, such as radiation or circuit noise. Soft errors can
cause unexpected behavior, which in turn can lead to sporadic,
incorrect results in computation and incorrect values in memory.

Current high-performance systems experience these errors at a rate of
perhaps one fault every few days. As high-performance systems grow
from petascale computation to exascale computation, they will be
composed of an increasing number of components, and each of these
components will be less reliable. The combination of these two factors
can lead to a dramatic increase in the number of soft errors per time
in a system to perhaps one error an hour. In order to generate correct
results, and, in many cases, to generate results at all, these future
systems will have to correct the erroneous values generated by an
increasingly large number of errors.

Current approaches to dealing with soft errors include using more
resilient components and a family of protocols that preserve state
known as checkpoint/restart (CR). If a system utilizes components that
are, at some additional manufacturing cost, less likely to fail than
cheaper components, the system will, on average, experience errors
less often. However, it is physically impossible to manufacture an
error-free component, and, with an increase in scale and increasing
energy demands, even systems made of more resilient components will
eventually succumb to high error rates. In CR, critical elements of
system state are periodically preserved onto a more stable medium. CR
does compensate for some errors, but it has two major drawbacks.
First, taking a checkpoint requires interrupting forward progress for
a certain amount of time. If failure rate is sufficiently high, the
frequency of required checkpoints will preclude forward progress.
Second, it is possible for a soft error to go undetected and corrupt
system state without invoking a restart. We call these escaped errors
(EE) and CR systems have no way to deal with them.

Due to the shortcomings of current approaches, it will likely be
necessary to find more nuanced approaches to dealing with soft errors
in the future. Due to the difficulty of detecting soft errors and the
difficulty in recovering from soft errors quickly, it is likely that
these new approaches will need to be tailored to specific
applications.

We apply the Global View Resilience (GVR) library to evaluate
approaches for tolerating soft errors in a preconditioned conjugate
gradient (PCG) solver, an important kernel and an exemplar for a wide
range of scientific applications. We use our exploration to evaluate
these approaches in terms of runtime and likelihood of EE for PCG. We
also evaluate the utility of GVR in augmenting our solver with
algorithms that tolerate soft errors.

To detect errors, we utilize two primary classes of solvers.
Residual-based solvers attempt to deduce errors based exclusively on
the historical distance between PCG's current approximation of the
correct answer and the actual correct answer. Algorithm-based solvers
perform extra linear algebra operations to verify certain mathematical
invariants particular to PCG.

To correct errors, we utilize per-data-structure snapshots and
restoration from said snapshots at different frequencies.

We run our PCG solver with various configurations of injected soft
errors, detectors, and correctors to solve problems based on 16
real-world matrices from the University of Florida Sparse Matrix
Collection. For each of these configurations, we evaluate the number
of iterations required for the solver to converge, whether EE has
occurred, and runtime.

These studies show: 1) though inexpensive, residual-based detection
performs poorly. To achieve acceptably low false negative rates, much
higher (30x number of mitigated false negatives) false positives rates
are required. 2) though more expensive, algorithm-based detection
performs better overall, achieving much lower false negative rates at
one seventh the false positive rate. Even this ``expensive'' error
detection is inexpensive compared to a single iteration, and therefore
is viable for linear solvers---particularly in high fault-rate
systems.

We conclude that, given an appropriate resilience scheme, it is likely
that PCG is still usable even exascale computing environments.
Further, GVR is an adequate tool to add flexible resilience to PCG
with relative ease.

CURRENT MASTER'S DRAFT IS: VERSION 1

Zachary's advisor is Prof. Andrew Chien

Login to the Computer Science Department website for details:
 https://www.cs.uchicago.edu/phd/ms_announcements#zar1

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list