[Colloquium] Hao Jiang Dissertation Announcement

n-yack at uchicago.edu n-yack at uchicago.edu
Mon Aug 2 10:32:50 CDT 2021


This is an announcement of Hao Jiang's Dissertation Defense
===============================================
Date: Thursday, August 12, 2021

Time: 11 am CST

Location:  https://uchicago.zoom.us/j/91654052239?pwd=QU5BNjRaY0lZazFEOWJIQTI4VEg2dz09
Password: 048616

Ph.D. Candidate: Hao Jiang

Title: Efficient Lossless Compression in and beyond Columnar Databases

Abstract: Columnar databases have dominated the data analysis market for their superior performance in query processing with Big data. However, the extensive data size also brings challenges to data storage and transfer. While people often rely on lossless compression techniques to address this problem, database researchers tend to overlook compression in the pre-columnar-database era. There are two primary reasons. First, there are not many compression algorithms available. Byte-oriented compression algorithms such as Gzip are the de facto only choice. Second, Gzip alike algorithms have a significant impact to query performance. The prosperity of columnar databases changes this situation. Storing data in separated columns enable the application of more compression algorithms designed for a single data type. With the changes, new challenges also arise. This dissertation addresses three challenges of lossless compression in columnar databases: better encoding algorithms, faster query on encoded data, and selecting proper encoding algorithms for data columns. We present PIDS, a novel compression approach for string columns; SBoost, a C++ library for fast queries on encoded data; and CodecDB, an encoding-aware database with a data-driven encoding selection. We show that these innovations allow columnar databases to excel the competitors by orders of magnitude in storage efficiency and query speed, demonstrating the great potential of compression techniques in columnar databases. Have seen the achievement brought by combining columnar layout and compression techniques, we further expand the design space and explore the potential of using columnar layout and compression to accelerate other data structures. As a result of this exploration, we present CoLoM, a key-value store that uses columnar layout and compression to improve LSM-tree efficiency.

Advisors: Aaron Elmore

Committee Members: Aaron Elmore, Sanjay Krishnan, and Raul Castro Fernandez



More information about the Colloquium mailing list