[Colloquium] [defense] Tong/Dissertation Defense/Jul 14, 2020

Margaret Jaffey margaret at cs.uchicago.edu
Tue Jun 30 14:21:31 CDT 2020


This is an announcement about Hao Michael Tong's dissertation defense.

Here is the Zoom link to participate:

https://zoom.us/j/91334775723?pwd=eXY5T3AvQ1NBSit6WGtoKzlOOHhRZz09
Meeting ID: 913 3477 5723 Password: 776228

       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Michael (Hao) Tong

Date:  Tuesday, July 14, 2020

Time:  2:00 PM

Place:  remotely via Zoom

Title: Improving the Performance of Long Running Scientific Pipelines
in a Bioinformatics Pipeline Platform

Abstract:
The Genomic Data Commons (GDC) is a data platform for managing,
processing, analyzing, and sharing cancer genomics data. The data
processing component of the GDC is called the GDC Pipeline Automation
System (GPAS). GPAS currently uses an on-premise cluster that uses
virtual machines (VMs) and bare metal machines to run multiple
bioinformatics pipelines.

The GPAS has been used in production for over two years and valuable
pipeline statistics are scattered in multiple databases across the
platform. This dissertation presents a platform-wide statistics
collecting service for the GPAS, and based the synthesized statistics,
several performance issues have been identified and investigated.

The first performance issue examined is that jobs on VMs exhibit
highly varied performance. In particular, there can be a very long
tail, with some VMs taking significantly longer than others to execute
the same jobs. Through an analysis of jobs statistics and traces, we
find that the root cause is the virtual machine memory management
layer in the VM hypervisor. When the layer is overwhelmed by intense
searches for memory mappings from virtual machine to the physical
host, it causes the performance of the VM to degrade.

The second performance issue examined concerns job scheduling. Through
an analysis of production statistics, we find that GPAS’ overall work
progress can be delayed by days even if only a small percentage of
jobs fail. A few other drawbacks of the current simple job scheduling
model have been listed with evidence in the dissertation. A more
sophisticated task-based scheduling model is proposed in this
dissertation.

Lastly, a thorough literature review is presented in this dissertation
towards a vision for the GPAS with further improved pipeline
performance.

Michael (Hao)'s advisor is Prof. Robert Grossman

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://newtraell.cs.uchicago.edu/phd/phd_announcements#michaelht

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (JCL 350)              (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list