[CS] Suhail Rehman Dissertation Defense/Feb 10, 2023

Megan Woodward meganwoodward at uchicago.edu
Mon Jan 30 14:43:08 CST 2023

This is an announcement of Suhail Rehman's Dissertation Defense.
Candidate: Suhail Rehman

Date: Friday, February 10, 2023

Time:  2 pm CST

Remote Location: https://uchicago.zoom.us/j/95536841017?pwd=eXc1VEZoV2ZuMDNNbkpUMGJlemlXQT09

Location: JCL 298

Title: Reconstructing the Lineage of Artifacts in Data Lakes

Abstract: The explosive growth of data-driven fields such as machine learning and data science has led to a proliferation of large amounts of data and systems, tools, and techniques to acquire, clean, process, prepare, curate, wrangle and analyze data. This has led to the creation of data lakes -- large repositories of data often used as a central source of truth for data-driven applications. However, the lack of lineage information in data lakes can affect the quality of data processed and the insights derived from data lakes. Existing solutions for lineage involve manual annotation of lineage information or capturing lineage as data is manipulated and transformed. This does not solve the problem of lineage and quality for data generated in the past and is often cited as an impediment to the overall vision of reducing the time to insight from vast amounts of data organized in a central data lake. This thesis proposes using similarity metrics to infer the lineage of data artifacts in data lakes. We show the feasibility of recovering the lineage of data artifacts under varying assumptions of the availability of metadata using RELIC. We then scale RELIC using sketching and indexing techniques and show that we can answer complex lineage queries accurately and efficiently with a suitable index structure. Finally, we introduce FUZZYDATA, a dataframe benchmarking system that can generate dataframe workflows of varying complexity using different dataframe clients to benchmark and test dataframe-based systems.

Advisors: Aaron Elmore

Committee Members: Aaron Elmore, Raul Castro Fernandez, and Michael Franklin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/cs/attachments/20230130/607ed0d1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: suhail_draft_v2.pdf
Type: application/pdf
Size: 3039357 bytes
Desc: suhail_draft_v2.pdf
URL: <http://mailman.cs.uchicago.edu/pipermail/cs/attachments/20230130/607ed0d1/attachment-0001.pdf>

More information about the cs mailing list