[CS] Hao Jiang Candidacy Exam/Mar 25, 2021

pbaclawski at uchicago.edu pbaclawski at uchicago.edu
Wed Mar 24 09:42:10 CDT 2021


This is an announcement of Hao Jiang's Candidacy Exam.
===============================================
Date: Thursday, March 25, 2021

Time: 10:30AM CST

Location: Via Zoom

https://uchicago.zoom.us/j/91947008230?pwd=VWRiUnJFcC9KSjJURTBzdExlbkJ6UT09

Candidacy Candidate: Hao Jiang

Title: Efficient Lightweight Compression in and beyond Columnar Databases

Abstract: Columnar databases have dominated the data analytic area for the superior performance in query processing with Big data. The extensive data size also brings challenges to data storage and transfer. While people often rely on lossless compression techniques to address this problem, database researchers often overlook compression in the pre-columnar-database era. There are two primary reasons. First, there are not many compression algorithms available. Byte-oriented compression algorithms such as Gzip are the de facto only choice. Second, Gzip alike algorithms have a significant impact to query performance. The prosperity of columnar databases changes this situation. Storing data in separated columns enable the application of more compression algorithms designed for a single data type. These algorithms also allow efficient decompression and fast query. With these changes, new challenges also arise. In this proposal, we address three challenges of lossless compression with columnar databases: better encoding algorithms, faster query on encoded data, and select proper encoding algorithms for data columns, and present our research effort on them. We also explore the potential of using columnar layout and compression to accelerate other data structures. We present PIDS, a novel compression approach for string columns; SBoost, a C++ library for fast queries on encoded data; and CodecDB, an encoding-aware database with a data-driven encoding selection. Experiments show that these improvements excel the competitors by orders of magnitude in both storage efficiency and query speed. We propose CoLSM, a key-value store that uses columnar layout and compression to improve LSM-tree efficiency.

Advisor: Aaron Elmore

Committee Members: Aaron Elmore, Sanjay Krishnan, and Raul Castro Fernandez



More information about the cs mailing list