[Colloquium] Reminder: Hao/MS Presentation/Sep 12, 2016

Margaret Jaffey via Colloquium colloquium at mailman.cs.uchicago.edu
Fri Sep 9 09:18:56 CDT 2016


This is a reminder about Mingzhe's MS Presentation on Monday.

------------------------------------------------------------------------------
Date:  Monday, September 12, 2016

Time:  11:00 AM

Place:  Ryerson 276

M.S. Candidate:  Mingzhe Hao

M.S. Paper Title: The Tail at Store: A Revelation of Storage Tail
Latencies from Millions of Hours of Disk and SSD Deployments

Abstract:
Nowadays RAID groups are widely deployed to provide storage systems
with high availability, reliability and acceptable overhead. However,
this design also brings in vulnerability against performance tail from
underlying drives. As each operation needs to access multiple drives,
operations get hindered by the slowest. To evaluate how this
vulnerability has turned into practical problem, we study storage
performance in over 450,000 disks and 4,000 SSDs over 87 days for an
overall total of 857 million (disk) and 7 million (SSD) drive hours.
We find that storage performance instability is not uncommon: 0.2% of
the time, a disk is more than 2x slower than its peer drives in the
same RAID group (and 0.6% for SSD). As a consequence, disk and
SSD-based RAIDs experience at least one slow drive (i.e., storage
tail) 1.5% and 2.2% of the time. Furthermore, we find that tails on
the same drives can last for hours continously, which strongly
recommends actions to be taken timely. However, simply replacing the
slow drives is not smart, as the tail occurrences spread out among
large fraction (26% of disks and 29% of SSDs), and only few drives
tend to have large number of slowdown occurences within their log
time. To understand the root causes, we correlate slowdowns with other
metrics (workload I/O rate and size, drive event, age, and model).
Result shows that slowdowns are rarely caused by I/O rate or size
imbalance, and only few slowdowns happen around the occurence of drive
errors captured by monitoring system. Regarding drive properties, we
get a clear trend older drives and certain drive models are more
likely to suffer from slowdowns. Overall, we find that the primary
cause of slowdowns are the internal characteristics and idiosyncrasies
of modern disk and SSD drives. We observe that storage tails can
adversely impact RAID performance. For example, a slow SSD can bring
the whole RAID group 23% more chance to suffer more than 2x
performance degradation. To the best of our knowledge, this work is
the most extensive documentation of storage performance instability in
the field. Our observations reveal that storage systems can take
advantage of tail-tolerant mech- anisms at low level. By looking into
users’ actions against slowdowns, we find that administrators at
customer site sometimes unplug slow drives for offline diagnosis and
repair, then replug them back to RAID. However, these offline actions
turn out to be ineffective as replugged drives tend to show slowdown
again in future. To mitigate the negative impact of tails in RAID, we
introduce TTRAID (tail-tolerant RAID). TTRAID can free regular
full-stripe operations from the first and second tails. For reads,
TTRAID would additionally proactively or passively access the parity
to reconstruct data from slowest drives. For writes, TTRAID allows for
earlier return once minimal work is done. The evaluation shows that
TTRAID successfully shield the longest tails and provide RAID with
better performance. It also indicates further opportunity on applying
tail-tolerant mechanism at low level, such as tail-tolerant operating
system, which is part of our future work.

Mingzhe's advisor is Prof. Haryadi Gunawi

Login to the Computer Science Department website for details:
 https://www.cs.uchicago.edu/phd/ms_announcements#hmz20000

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list