[CS] REMINDER: Yuanjian Liu Candidacy Exam/Dec 5, 2024

Wed Dec 4 10:55:18 CST 2024

This is an announcement of Yuanjian Liu's Candidacy Exam.
===============================================
Candidate: Yuanjian Liu

Date: Thursday, December 05, 2024

Time: 12:30 pm CST

Remote Location: https://www.google.com/url?q=https://uchicago.zoom.us/j/6019256463?pwd%3DSENUMzJEZEJDMVhMaHZiVDI2V09qdz09&sa=D&source=calendar&ust=1732651564734497&usg=AOvVaw0WSwRQAh2gFoo0r-PkBtgX

Location: JCL 356

Title:  Can lossy compression algorithms confidently optimize data transfer without quality issues?

Abstract: Large volumes of scientific data are produced by all kinds of applications at an increasing speed and require efficient storage optimizations. People use many types of compression algorithms to aid this situation for different data. Lossless compression algorithms can keep 100% data fidelity but the storage reduction is very marginal (1.5x-2x). On the other hand, lossy compression can significantly reduce the data sizes (10x-1000x+) with configurable error thresholds. Theoretically, there is no limit to the compression ratio on lossy compression algorithms if users allow a very high error bound, but the data will be unusable after a certain point. Due to the uncertainty about the 'certain point', scientists are not confident about using lossy compression, fearing it may lead to inaccuracies in downstream analysis. How to obtain both high compression ratios provided by lossy compression algorithms and certainty about data usability remains an open question. In this thesis, my goal is to provide an optimized data transfer solution that involves lossy compression confidently. I will explore the solutions to apply lossy compression on scientific data while respecting needed constraints and data fidelity. I first explore ways to ensure certain data constraints are met for the reconstructed data by designing lossy compression algorithms that preserve multiple error bounds on various value ranges and regions. To help users have an idea of the decompressed data, I explore methods to predict the compression performance (compression ratio, time, peak signal-to-noise ratio, etc.) with minimal overhead costs. To aid the selection of the most suitable compressor and make users more confident about the compressed data quality for a given dataset, I explore an interactive compression paradigm, where users can preview a certain part of the raw data as well as the decompressed data generated by multiple compressors with a visualization method. For extremely large datasets, I explore ways to utilize multiple compute nodes to collaboratively compress a single large file without breaking memory constraints. I also explore compression methods for some special types of data that cannot be compressed well by error-bounded compression algorithms, including genome sequence data. Finally, I combine all the proposed algorithms and mechanisms into an intelligent and scalable system, GlobaZip, to be shared with the scientific community.

Advisors: Ian Foster and Kyle Chard

Committee Members: Sheng Di, Kyle Chard, Ian Foster