[Colloquium] Reminder: Rubenstein/MS Presentation/Feb 13, 2014

Margaret Jaffey margaret at cs.uchicago.edu
Wed Feb 12 09:45:58 CST 2014


This is a reminder about Zachary's MS Presentation tomorrow.

------------------------------------------------------------------------------
Date:  Thursday, February 13, 2014

Time:  1:30 PM

Place:  Ryerson 276

M.S. Candidate:  Zachary Rubenstein

M.S. Paper Title: Error Checking and Snapshot-Based Recovery in a
Preconditioned Conjugate Gradient Solver

Abstract:
With some probability per unit time, each component in a computer
system can experience an intermittent hardware error, or soft error,
due to a number of factors like radiation or circuit noise. Soft
errors can cause incorrect behavior. Incorrect behavior, in turn,
leads to sporadic, incorrect results in computation and incorrect
values in memory.

Current high-performance systems experience errors that go uncorrected
at a rate of roughly one error per day. As high-performance systems
grow from petascale computation to exascale computation, the number of
components will increase, and each of these components will be less
reliable. Experts project that soft errors may increase dramatically
to perhaps one hardware uncorrectable error per hour, or even every
few minutes. In order to generate correct results, and, in many cases,
to generate results at all, these future systems, and perhaps
applications, will have to correct these erroneous values generated by
errors.

Current approaches rely on checkpoint and restart routines (CR). CR
periodically preserves values that are hard to recalculate, and, in
the event of an error, restarts computation using preserved values to
ameliorate rework. CR depends on two key assumptions. First, the time
required to checkpoint system state is small compared to time to run a
computation. Second, all errors will be detected promptly. As system
size and fault rate increase, many experts believe neither assumption
will hold true.

Due to the shortcomings of current approaches, we believe it will be
necessary to find more specialized approaches to soft errors. We
assume that it will be difficult to find a single, universal approach
because soft errors are difficult to detect and because there is no
single recovery technique applicable to every application.

In this thesis, we take a preconditioned conjugate gradient (PCG)
solver as an exemplar for a wide range of scientific applications. We
seek an approach to make a PCG solver that produces correct results
quickly in an environment that is prone to errors.

In order to find such a fault tolerance technique, we apply the Global
View Resilience (GVR) library to evaluate approaches for tolerating
soft errors. Our exploration evaluates each approach in runtime and
likelihood of escaped errors (errors which are undetected and
unmasked). We also evaluate the utility of GVR in augmenting our
solver with algorithms that tolerate soft errors.

To detect errors, we utilize two primary classes of detectors.
Residual-based detectors attempt to deduce errors based exclusively on
the historical distance between PCG's current approximation of the
correct answer and the actual correct answer. Algorithm-based
detectors perform extra linear algebra operations to verify certain
mathematical invariants particular to PCG. To correct errors, we
utilize per-data-structure snapshots which have different frequencies
and restoration from said snapshots.

Our experiments study a PCG solver with various configurations of
injected soft errors, detectors, and correctors to solve problems
based on 16 real-world matrices from the University of Florida Sparse
Matrix Collection. For each configuration, we evaluate the number of
iterations required for the solver to converge, whether EE has
occurred, and runtime.

These studies show: 1) Though inexpensive, residual-based detection
performs poorly. To achieve acceptably low false negative rates, high
(30x number of mitigated false negatives) false positives rates are
required. 2) Though more expensive, algorithm-based detection performs
better overall, achieving much lower false negative rates at one
seventh the false positive rate. Even this relatively expensive error
detection is inexpensive compared to a single solver iteration, and
therefore is viable for linear solvers---particularly in high
error-rate systems.

We used GVR to add multiversioning to our solver. Adding GVR required
change in less than one percent of lines of code. GVR proved adequate
to the task of adding reliability to a solver without necessitating
extensive extra work.

We conclude that, given an appropriate resilience scheme, it is likely
that PCG is still usable even in faulty computing environments.
Resilience can be achieved at a small overhead, especially using more
expensive, accurate error detection. Further, GVR's multiversioning
and multistream facilities can be used to add flexible resilience to a
PCG solver. Our results suggest promising possibilities for exascale
environments.

THE CURRENT VERSION IS: VERSION 3

Zachary's advisor is Prof. Andrew Chien

Login to the Computer Science Department website for details:
 https://www.cs.uchicago.edu/phd/ms_announcements#zar1

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list