[Colloquium] Reminder: Fang/MS Presentation/May 11, 2015

Margaret Jaffey margaret at cs.uchicago.edu
Fri May 8 09:18:16 CDT 2015


This is a reminder about Aiman's MS Presentation on Monday.

------------------------------------------------------------------------------
Date:  Monday, May 11, 2015

Time:  9:00 AM

Place:  Ryerson 277

M.S. Candidate:  Aiman Fang

M.S. Paper Title: How Much SSD Is Useful For Resilience In
Supercomputers

Abstract:
Future extreme-scale (exascale) systems are predicted to have high
error rates. Jobs running on such systems will encounter frequent
failures. Checkpoint/restart based on non-volatile memories in the
form of SSD’s (Solid State Disks) and used as burst buffers is a
promising approach to meet resiliency need. However, because of SSD’s
high cost and limited lifetime, understanding their effective use and
appropriate provisioning is critical.

We explore two problems. First, for a set of jobs, how to allocate
limited SSD lifetime to increase their efficiency in the face of
failures? (Allocation). Second, given a supercomputer system with a
particular error rate, how much SSD lifetime is worth buying to
increase resilience and thereby system efficiency? (Provisioning).

We derive a model that captures the characteristics of jobs and
systems, and use it to formulate the Allocation problem. We use this
model to study the allocation and provisioning questions under a
variety of mission-oriented policy scenarios including job size-count,
equal job efficiency and maximum system efficiency. We first apply the
model on realistic workloads to understand the impact of job
characteristics and mix on allocation. Second, we explore properties
and performance of three allocation policies on various workloads.
Third, we explore appropriate SSD provisioning to achieve acceptable
system efficiency.

Our results first show that the SSD lifetime constraint changes the
checkpoint interval, and thereby the achievable job and system
efficiency. Second, the system efficiency can be increased remarkably
by considering a global perspective (workload mix) for SSD lifetime
allocation. Finally, further results suggest that underprovisioning
SSD lifetime (only 10-20% of requirements without resource constraint)
is sufficient to produce 90% system efficiency at failure rates three
times that of current systems.

Aiman's advisor is Prof. Andrew Chien

Login to the Computer Science Department website for details:
 https://www.cs.uchicago.edu/phd/ms_announcements#aimanf

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list