[Colloquium] Reminder: Jiang/MS Presentation/Jun 5, 2018

Mon Jun 4 11:28:38 CDT 2018

This is a reminder about Hao Jiang's MS Presentation tomorrow.

------------------------------------------------------------------------------
Date:  Tuesday, June 5, 2018

Time:  2:30 PM

Place:  Eckhart 117

M.S. Candidate:  Hao Jiang

M.S. Paper Title: Optimizing Lightweight Encoding In Columnar Store

Abstract:
In columnar databases, data is generally stored in an encoded format
to save storage space and reduce I/O. Columnar encoding is a family of
encoding methods that reduce storage size of an attribute, while still
enabling efficient in situ data processing. Popular encoding schemes
include dictionary encoding, delta encoding, run-length encoding, and
bit-packed encoding. In this thesis, we propose methods to optimize
columnar encoding for both space and time efficiency.

The selection of right encoding for an attribute is critical for
ensuring good compression, however prior work and open-source systems
rely on static rules based global knowledge of the dataset or
simplistic rules based on the data types We evaluate the impact and
selection of encoding by studying a popular open-source columnar
storage framework, Parquet. We highlight how encoding implementation
differences leads to challenges in selecting the ideal encoding,
explore a data-driven method to select encoding schemes for a given
dataset, and evaluate various encoding schemes on a large corpus of
public datasets. We also examine decomposing attributes into
sub-attributes to enable better compression. This evaluation
highlights shortcomings with existing techniques and shows promising
directions for efficient columnar storage systems. In many columnar
data store implementations, performing queries on encoded data re-
quires the data to be first decoded to memory, which is
time-consuming. We design several novel SIMD-based algorithms to speed
up query execution on encoded data. Our algorithms use SIMD to
vectorize the execution and skip unnecessary decoding for higher
efficiency, achieving a throughput of filtering up to 18 billion
numbers per second with single thread. We build SBoost, a columnar
data store utilizing these algorithms to speed up filtering on encoded
data, thus improving query efficiency. SBoost is written in Java and
invokes the SIMD algorithms using JNI, making it readily available for
Java-based query platforms, which are dominant in open-source data
analytic systems. SBoost demonstrates great potential in speeding up
query efficiency in both disk-based analytic queries and in-memory
queries by reducing query time by up to 90% compared to Apache
Parquet.

Hao's advisor is Prof. Aaron Elmore

Login to the Computer Science Department website for details:
 https://www.cs.uchicago.edu/phd/ms_announcements#hajiang

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=