[Staff] [Faculty] [Colloquium] George Candea today 2/28/05
Nita Yack
nitayack at midway.uchicago.edu
Mon Feb 28 08:17:33 CST 2005
COMPUTER SCIENCE
The University of Chicago
TALK
Monday, February 28, 2005
2:30 p.m.
Ryerson 251
George Candea,
Stanford University
http://www.stanford.edu/~candea
Title: A Design, Mechanism and Policy for Highly Available Software
Systems
Abstract Software failures are a dominant cause of outages in
large-scale software-intensive systems; the exact root cause of these
failures is often unknown and the only cure is to reboot.
Unfortunately, rebooting can be expensive, leading to nontrivial
service disruption or downtime even when clusters and failover are
employed.
In this talk I will describe the "crash-only design," a way to build
reboot-friendly systems that engenders a new way of thinking about
recovery. I will also present the "microreboot," a technique for
surgically recovering faulty application components, without disturbing
the rest of the application. I will argue that recovery-oriented
techniques complement bug-reduction efforts and provide significant
improvements in software dependability. We applied this design and
technique to a satellite ground station and an Internet auction system;
without fixing any of the system's bugs, microreboots recovered most of
the same failures as full reboots, but did so an order of magnitude
faster and with an order of magnitude savings in lost work.
Simple, cheap recovery enables new policies for achieving high
availability. First, the cost of superfluous recovery is low, so we
can microreboot at the slightest hint of failure, without having
certainty that a failure occurred -- the added cost of increased
aggressiveness is outweighed by the benefits of early recovery. Cheap
recovery allowed us to use failure detection based on statistical
learning, which yields fewer false negatives at the cost of higher
false positives; by closing the monitor-diagnose-recover loop, we
built an autonomously recovering system. Second, we can
prophylactically use microreboots to rejuvenate a software system by
parts, without ever bringing it down. Finally, we can mask recovery
from end users through transparent call-level retries; since
microreboots take less than half a second, such masking turns failures
into human-tolerable latency blips.
Host: Anne Rogers
*The talk will be followed by refreshments in Ryerson 255*
Persons who need assistance should call 773-702-6614
Nita Yack
Departmental Administrator
Computer Science Department
1100 E. 58th Street - Room 151
Chicago, IL 60637
(773) 702-6019
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3407 bytes
Desc: not available
Url : http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20050228/63251edd/attachment.bin
More information about the Colloquium
mailing list