[Colloquium] Raicu/Dissertation Defense/Feb 12, 2009

Thu Jan 29 10:58:58 CST 2009

       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***

Candidate:  Ioan Raicu

Date:  Thursday, February 12, 2009

Time:  3:00 PM

Place:  Ryerson 251

Title: Many-Task Computing: Bridging the Gap between High Throughput
Computing and High Performance Computing

Abstract:
Many-task computing aims to bridge the gap between two computing
paradigms, high throughput computing and high performance computing.
Many task computing is reminiscent to high throughput computing, but
it differs in the emphasis of using many computing resources over
short periods of time to accomplish many computational tasks, where
the primary metrics are measured in seconds (e.g. floating point
operations per second, tasks per second, input or output rates per
second), as opposed to operations per day or month (e.g. jobs per
month). Many task computing denotes high-performance computations
comprising of multiple distinct activities, coupled via file system
operations. Tasks may be small or large, uniprocessor or
multiprocessor, compute-intensive or data-intensive. The set of tasks
may be static or dynamic, homogeneous or heterogeneous, loosely
coupled or tightly coupled. The aggregate number of tasks, quantity of
computing, and volumes of data may be extremely large. Many-task
computing includes loosely coupled applications that are generally
communication-intensive but not naturally expressed using message
passing interface commonly found in high performance computing,
drawing attention to the many computations that are heterogeneous but
not “happily” parallel. This dissertation explores fundamental issues
in defining the many-task computing paradigm, as well as theoretical
and practical issues in supporting both compute and data intensive
many-task computing on large scale systems. We have defined an
abstract model for data diffusion, have defined scheduling policies
with heuristics to optimize real world performance, and developed a
competitive online caching eviction policy. We also designed and
implemented the necessary middleware – Falkon – to enable the support
of many-task computing on clusters, grids and supercomputers. Falkon,
a Fast and Light-weight tasK executiON framework, addresses
shortcomings in traditional resource management systems that support
high throughput and high performance computing that are not suitable
or efficient at supporting many-task computing applications. Falkon
was designed to enable the rapid and efficient execution of many tasks
on large scale systems (i.e. through multi-level scheduling and
streamlined distributed task dispatching), and integrate novel data
management capabilities (i.e. data diffusion which uses data caching
and data-aware scheduling to exploit data locality) to extend data
intensive applications scalability well beyond that of traditional
shared or parallel file systems. As the size of scientific data sets
and the resources required for their analysis increase, data locality
becomes crucial to the efficient use of large scale distributed
systems for data-intensive many-task computing. We propose a “data
diffusion” approach that acquires compute and storage resources
dynamically, replicates data in response to demand, and schedules
computations close to data. As demand increases, more resources are
acquired, allowing faster response to subsequent requests that refer
to the same data, and as demand drops, resources are released. This
approach provides the benefits of dedicated hardware without the
associated high costs, depending on workload and resource
characteristics. Micro-benchmarks have shown Falkon to achieve over
15K+ tasks/sec throughputs, scale to millions of queued tasks, and to
execute billions of tasks per day. Data diffusion has also shown to
improve applications scalability and performance, with its ability to
achieve hundreds of Gb/s I/O rates on modest sized clusters. Falkon
has shown orders of magnitude improvements in performance and
scalability across many diverse workloads (e.g heterogeneous tasks
from 100ms to hours long, compute intensive, data intensive, varying
arrival rates) and applications (e.g. astronomy, medicine, chemistry,
molecular dynamics, economic modeling, and data analytics) at scales
of billions of tasks on hundreds of thousands of processors across
clusters, Grids (e.g. TeraGrid), and supercomputers (e.g. IBM Blue
Gene/P and Sun Constellation).

Candidate's Advisor: Prof. Ian Foster

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://www.cs.uchicago.edu/phd/phd_announcements#iraicu

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=