[Staff] [Faculty] [Colloquium] George Candea today 2/28/05

Nita Yack nitayack at midway.uchicago.edu
Mon Feb 28 08:17:33 CST 2005


COMPUTER SCIENCE
The University of Chicago
TALK



Monday, February 28, 2005
2:30 p.m.
Ryerson 251

George Candea,
Stanford University
http://www.stanford.edu/~candea

Title: A Design, Mechanism and Policy for Highly Available Software 
Systems

Abstract  Software failures are a dominant cause of outages in 
large-scale software-intensive systems; the exact root cause of these 
failures is often unknown and the only cure is to reboot.  
Unfortunately, rebooting can be expensive, leading to nontrivial 
service disruption or downtime even when clusters and failover are 
employed.

In this talk I will describe the "crash-only design," a way to build 
reboot-friendly systems that engenders a new way of thinking about 
recovery.  I will also present the "microreboot," a technique for 
surgically recovering faulty application components, without disturbing 
the rest of the application.  I will argue that recovery-oriented 
techniques complement bug-reduction efforts and provide significant 
improvements in software dependability.  We applied this design and 
technique to a satellite ground station and an Internet auction system; 
without fixing any of the system's bugs, microreboots recovered most of 
the same failures as full reboots, but did so an order of magnitude 
faster and with an order of magnitude savings in lost work.

Simple, cheap recovery enables new policies for achieving high 
availability.  First, the cost of superfluous recovery is low, so we 
can microreboot at the slightest hint of failure, without having 
certainty that a failure occurred -- the added cost of increased 
aggressiveness is outweighed by the benefits of early recovery.  Cheap 
recovery allowed us to use failure detection based on statistical 
learning, which yields fewer false negatives at the cost of higher 
false positives; by closing the monitor-diagnose-recover loop, we
built an autonomously recovering system.  Second, we can 
prophylactically use microreboots to rejuvenate a software system by 
parts, without ever bringing it down.  Finally, we can mask recovery 
from end users through transparent call-level retries; since
microreboots take less than half a second, such masking turns failures 
into human-tolerable latency blips.


Host:  Anne Rogers
*The talk will be followed by refreshments in Ryerson 255*
Persons who need assistance should call 773-702-6614

Nita Yack
Departmental Administrator
Computer Science Department
1100 E. 58th Street - Room 151
Chicago, IL 60637
(773) 702-6019
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3407 bytes
Desc: not available
Url : http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20050228/63251edd/attachment.bin


More information about the Colloquium mailing list