[Colloquium] Reminder - Arjun Rawal MS Presentation/June 2, 2020

Jessica Garza jdgarza at cs.uchicago.edu
Tue May 26 10:49:53 CDT 2020


This is an announcement of Arjun Rawal's MS Presentation. Arjun is a student in the Bx/MS program.

Here is the Zoom link to participate:
https://uchicago.zoom.us/j/98111735383?pwd=UndQbnpJVFVtTmpDeUxWU3luUm9jdz09 <https://uchicago.zoom.us/j/98111735383?pwd=UndQbnpJVFVtTmpDeUxWU3luUm9jdz09>

Password: 206975

One tap mobile
+13126266799,,98111735383# US (Chicago)

Dial by your location
        +1 312 626 6799 US (Chicago)

Meeting ID: 981 1173 5383

----------------------------------------------------------------------------------------------------------------------------------------

Date: Tuesday, June 2nd, 2020. 

Time: 10:30 AM, Central Time

Location: remote via Zoom

M.S. Candidate: Arjun Rawal

M.S. Paper Title: Exploiting Domain-Specific Data Properties to Improve Compression for High Energy Physics Data

Abstract:

Data storage is a fundamental concern for high energy physics; the experiments and data analysis needed to discover new results require petabytes of measurements from particle collisions. Accordingly, data compression has been a central focus of data storage solutions, as it provides an effective way to reduce storage costs and improve analysis performance. Whereas interactive analysis workloads benefit from fast data availability for computation, archival storage benefits from compression which makes data as small as possible. For most high energy physics data, the standard approach to compression is ``one size fits all” — data is stored for archive with the same compression used for interactive analysis. Because data analysis and long term storage are fundamentally different use cases, the tradeoffs made to provide performant data analysis result in relatively poor compression for long term data storage. We propose that high energy physics data could be stored much more compactly if we use modern computational algorithms and compression approaches that take into account the fundamental characteristics of the data. 

We study several modern compression algorithms and evaluate their performance on high energy physics data. We then evaluate a variety of techniques used in data compression to improve compression ratio: delta encoding, floating point representation, data aggregation, and dictionary optimizations. These algorithms and techniques exist in a tradeoff space where compression ratio, throughput, and resource utilization can be exchanged to find the best compression for a specific use. 

Evaluation on real datasets from the ATLAS and CMS experiments shows that adopting algorithms designed for modern processors and larger memory sizes can provide compression ratio improvements of 7% while providing better compression and decompression throughput. Furthermore, applying techniques that take into account the underlying type of a block of data, not just the bytes of data, can increase compression ratio by an additional 5%. Overall, we find that an approach that prioritizes compression ratio can reduce the overall size of data files by more than 15%, providing a significant reduction in data storage requirements.

However, this solution is useful only if it is cost-effective. We analyze the cost of scaling up our compression strategies for the ATLAS experiment. We find that a production implementation of our approach would require fewer than 50 CPU cores to handle reading a petabyte of data per day. This approach could reduce data storage requirements by more than 8 petabytes, and save hundreds of thousands of dollars in hard drive and tape storage costs each year. Hence, our approach is cost effective and feasible on a large scale. 


Advisor: Prof. Andrew A. Chien
Committee Members: Prof. Raul Castro Fernandez, Prof. Rob Gardner (Physics)



Jessica Garza
Assistant Director of Undergraduate Studies
Department of Computer Science
The University of Chicago
Covid-19 Resources <https://cs.uchicago.edu/remote2020/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20200526/a25c25d1/attachment.html>


More information about the Colloquium mailing list