[Colloquium] Fang/MS Presentation/May 11, 2015

Margaret Jaffey margaret at cs.uchicago.edu
Mon Apr 27 13:21:18 CDT 2015


This is an announcement of Aiman Fang's MS Presentation.

------------------------------------------------------------------------------
Date:  Monday, May 11, 2015

Time:  9:00 AM

Place:  Ryerson 277

M.S. Candidate:  Aiman Fang

M.S. Paper Title: How Much SSD Is Useful For Resilience In
Supercomputers

Abstract:
Future extreme-scale systems are projected to have higher error rates,
producing MTBF’s (Mean Time Between Failures) that could be less than
an hour. Large-scale jobs running on exascale systems will encounter
frequent failures. Checkpoint/restart based on non-volatile memories
in the form of SSD’s (Solid State Disks) and used as burst buffers is
a promising approach to meet resiliency need. However, because of
SSD’s high cost and limited lifetime, understanding their effective
use and appropriate provisioning is critical.

We explore two problems in this thesis. First, for a set of jobs in a
workload, how should we allocate SSD lifetime to increase their
efficiency in the face of failures? (Allocation). Second, given a
supercomputer system with a particular error rate, how much SSD
lifetime is worth buying to increase resilience and thereby system
efficiency? (Provisioning).

To answer these questions, we develop a model that captures the
characteristics of jobs and systems. It can optimize global properties
such as system efficiency, or local properties such as job efficiency
given resource constraint. We use this model to explore the allocation
and provisioning questions under a variety of mission-oriented policy
scenarios including job size-count, equal job efficiency and maximum
system efficiency.

To apply the model, we describe system of interest and workloads
properties. We first apply system-efficiency based allocation policy
to realistic workloads to understand the impact of job characteristics
and mix on allocation. Second, we explore properties and performance
of three different allocation policies on various workloads. Third, we
explore appropriate SSD provisioning to achieve acceptable system
efficiency.

First, our results show that system efficiency can be increased by as
much as 14% by considering a global perspective (workload mix, job
size) for SSD lifetime allocation. Second, with job-size based and
system-efficiency based allocation, large jobs suffer as much as 40%
job efficiency; job-efficiency based allocation much increase their
allocation by 50% to eliminate this disparity. Finally, further
results suggest that under-provisioning SSD lifetime (only 10-20% of
the “optimum” as defined by per-job requirements without resource
constraint) is sufficient to produces 90% system efficiency at failure
rates three times that of current systems.

Our results give insight into appropriate policies for SSD usage for
resilience in future supercomputers that include burst buffers, and
into the cost-effective approach to provisioning burst buffers on such
systems.

Aiman's advisor is Prof. Andrew Chien

Login to the Computer Science Department website for details:
 https://www.cs.uchicago.edu/phd/ms_announcements#aimanf

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list