[Colloquium] Zhang/Dissertation Defense/Apr 18, 2014

Margaret Jaffey margaret at cs.uchicago.edu
Thu Apr 3 16:39:17 CDT 2014



       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Zhao Zhang

Date:  Friday, April 18, 2014

Time:  10:00 AM

Place:  Ryerson 277

Title: Enabling Efficient Parallel Scripting on Large-scale Computers

Abstract:
Many-task computing (MTC) applications assemble existing sequential
(or parallel) programs, using POSIX files for intermediate data. The
parallelism of such applications often comes from data parallelism.
MTC applications can be grouped into stages, and dependencies between
tasks in different stages can be in the form of file production and
consumption. The computation stage of MTC applications can have a
large number of tasks, thus it can have a large amount of I/O traffic
(metadata traffic and I/O traffic) which is also highly concurrent.
Some MTC applications are iterative, where the computation iterates
over a dataset and exit when some condition(s) are reached. Some MTC
applications are interactive, where the application requires human
action between computation stages.

MTC applications have been seen in many scientific research domains:
astronomy, biochemistry, bioinformatics, psychology, economics,
climate science, physical chemistry, and neuroscience. These
applications cover a wide range of methodologies, such as rational
design, uncertainty quantification, parameter estimation, massive
dynamic graph pruning, Monte-Carlo-based iterative fixing, and inverse
modeling. Scientists have used numerous parallel programming
languages/models (for example, MPI, Hadoop, Swift, Makeflow, and
Pegasus) to execute the MTC applications on large-scale computers.
However, for some of these languages/models, such as MPI and Hadoop,
the mismatch between the languages and the applications results in a
loss of conciseness, as programmers work on in-memory variables while
the applications operate on files and directories. In some cases,
programmers need to implement complex data management logic inside the
program to achieve high efficiency, and the solution is often ad-hoc.
Out of these approaches, Swift and Makeflow are functional parallel
scripting languages. They fit MTC applications naturally, but they
lack efficient in-memory data management since they originated from a
distributed environment (grid computing). So the execution of these
parallel scripting programs is usually inefficient, due to the sheer
amount of application I/O and the underoptimized shared file system.

Thus, there is not a single parallel scripting framework that
simultaneously provides a concise programming interface, general
efficient execution, and scalable performance. More specifically, a
concise programming interface could avoid changes to the original
application code and preserve the way the application is invoked.
General efficient execution requires an application-independent
performance improvement scheme. And the parallel scripting framework
should be able to work on thousands of compute nodes with acceptable
overhead. Providing these three features at the same time is
challenging. For example, one common approach to improve execution
performance is through data-aware scheduling, where the scheduling
system needs the file usage information (input/output), while the
UNIX/Linux command line parameters are usage-blind. Caching is another
common approach to improve execution performance. To preserve the
original applications' POSIX interface, the parallel scripting
framework should be able to cache the data in RAM with the data being
accessible with POSIX interface (an in-RAM file system). However, at
the time this thesis research was conducted, mainstream file systems
usually had one metadata server, which is insufficient for the I/O
traffic of some MTC applications. (There have been later developments
of distributed metadata servers, which balances the metadata over
multiple nodes, but this is sub-optimal for MTC applications.)

In this dissertation we develop a complete parallel scripting
framework called AMFORA, which has a shared in-RAM file system and
task execution engine. It implements the multi-read single-write
consistency model, preserves the POSIX interface for original
applications, and provides an interface for collective data movement
and functional data transformation. It is interoperable with many
existing serial scripting languages (e.g., Bash, Python). AMFORA runs
on thousands of compute nodes on an IBM BG/P supercomputer. It also
runs on cloud environments such as Amazon EC2 and Google Compute
Engine. To understand the baseline MTC application performance on
large-scale computers, we define MTC Envelope, which is a file system
benchmark to measure the capacity of a given software/hardware stack
in the context of MTC applications.

The main contributions of this dissertation are:

A system independent approach to profile and understand the
concurrency of MTC applications' I/O behavior

A benchmark definition that measures the file system's capacity for
MTC applications

A theoretical model to estimate the I/O overhead of MTC applications
on large-scale computers

A scalable distributed file system design, with no centralized
component, that achieves good scalability

A collective file system management toolkit to enable fast data
movement

A functional file system management toolkit to enable fast file
content transformation

A new parallel scripting programming model that extends a scripting
language (e.g., Bash)

A novel file system access interface design that combines both POSIX
and non-POSIX interfaces to ease programming without loss of
efficiency

An automated method for identifying data flow patterns that are
amenable to collective optimizations at runtime

The open source implementation of the entire framework to enable MTC
applications on large-scale computers

Zhao's advisor is Prof. Ian Foster

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://www.cs.uchicago.edu/phd/phd_announcements#zhaozhang

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list