[Colloquium] Reminder: Fang/MS Presentation/May 11, 2015
Margaret Jaffey
margaret at cs.uchicago.edu
Fri May 8 09:18:16 CDT 2015
This is a reminder about Aiman's MS Presentation on Monday.
------------------------------------------------------------------------------
Date: Monday, May 11, 2015
Time: 9:00 AM
Place: Ryerson 277
M.S. Candidate: Aiman Fang
M.S. Paper Title: How Much SSD Is Useful For Resilience In
Supercomputers
Abstract:
Future extreme-scale (exascale) systems are predicted to have high
error rates. Jobs running on such systems will encounter frequent
failures. Checkpoint/restart based on non-volatile memories in the
form of SSDs (Solid State Disks) and used as burst buffers is a
promising approach to meet resiliency need. However, because of SSDs
high cost and limited lifetime, understanding their effective use and
appropriate provisioning is critical.
We explore two problems. First, for a set of jobs, how to allocate
limited SSD lifetime to increase their efficiency in the face of
failures? (Allocation). Second, given a supercomputer system with a
particular error rate, how much SSD lifetime is worth buying to
increase resilience and thereby system efficiency? (Provisioning).
We derive a model that captures the characteristics of jobs and
systems, and use it to formulate the Allocation problem. We use this
model to study the allocation and provisioning questions under a
variety of mission-oriented policy scenarios including job size-count,
equal job efficiency and maximum system efficiency. We first apply the
model on realistic workloads to understand the impact of job
characteristics and mix on allocation. Second, we explore properties
and performance of three allocation policies on various workloads.
Third, we explore appropriate SSD provisioning to achieve acceptable
system efficiency.
Our results first show that the SSD lifetime constraint changes the
checkpoint interval, and thereby the achievable job and system
efficiency. Second, the system efficiency can be increased remarkably
by considering a global perspective (workload mix) for SSD lifetime
allocation. Finally, further results suggest that underprovisioning
SSD lifetime (only 10-20% of requirements without resource constraint)
is sufficient to produce 90% system efficiency at failure rates three
times that of current systems.
Aiman's advisor is Prof. Andrew Chien
Login to the Computer Science Department website for details:
https://www.cs.uchicago.edu/phd/ms_announcements#aimanf
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156) (773) 702-6011
The University of Chicago http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
More information about the Colloquium
mailing list