[Colloquium] [Staff] Computer Science Seminars

Donna Brooms donna at cs.uchicago.edu
Tue Apr 30 05:36:53 CDT 2013


COMPUTER SCIENCE SEMINAR

Tuesday, April 30, 2013
9:30 a.m.
Ryerson 251
 
Dr. Leonardo A. Bautista-Gomez
Argonne National Laboratory
http://matsu-www.is.titech.ac.jp
 
Title:  “Fast checkpointing for extreme scale systems”
 
Abstract: In high performance computing, scientific applications need to make progress despite frequent failures. Thus, long running executions are periodically checkpointed to stable storage. Nowadays, the overhead imposed by parallel file system based checkpointing is about 25% of execution time. In future exascale supercomputers, checkpointing will become prohibitively time consuming. We developed a fault tolerance interface that exploits the features of large scale hybrid systems implementing a low-overhead high-frequency multi-level checkpoint that uses a Topology-Aware Reed-Solomon encoding algorithm with modern local storage devices, advanced clustering techniques and Fault Tolerance Dedicated Threads. Finally, we develop an exascale study using our performance model and we show that our approach can guarantee low overhead in future extreme scale systems.
---
Bio:
 
Leonardo A. Bautista-Gomez received a B.Sc. in computer science and a M.Sc. in parallel and distributed systems from the University Paris 6 Pierre & Marie Curie. He became a Dr. Sci. at the Tokyo Institute of Technology were he conducted research on fault tolerance for high performance computing.  In 2011, he received the ACM/IEEE George Michael Memorial High Performance Computing fellow for SC11, Honorable Mention and a Special Certificate of Recognition for achieving a perfect score in the reviews of his SC11 paper. He now conducts research on reliability for extreme scale supercomputers at Argonne National Laboratory.
 
Host: Andrew Chien
 
 
 


More information about the Colloquium mailing list