<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText">This is an announcement of Suhail Rehman's Dissertation Defense.<br>

===============================================<br>

Candidate: Suhail Rehman<br>

<br>

Date: Friday, February 10, 2023<br>

<br>

Time:  2 pm CST<br>

<br>

Remote Location: <a href="https://uchicago.zoom.us/j/95536841017?pwd=eXc1VEZoV2ZuMDNNbkpUMGJlemlXQT09">

https://uchicago.zoom.us/j/95536841017?pwd=eXc1VEZoV2ZuMDNNbkpUMGJlemlXQT09</a><br>

<br>

Location: JCL 298<br>

<br>

Title: Reconstructing the Lineage of Artifacts in Data Lakes <br>

<br>

Abstract: The explosive growth of data-driven fields such as machine learning and data science has led to a proliferation of large amounts of data and systems, tools, and techniques to acquire, clean, process, prepare, curate, wrangle and analyze data. This

 has led to the creation of data lakes -- large repositories of data often used as a central source of truth for data-driven applications. However, the lack of lineage information in data lakes can affect the quality of data processed and the insights derived

 from data lakes. Existing solutions for lineage involve manual annotation of lineage information or capturing lineage as data is manipulated and transformed. This does not solve the problem of lineage and quality for data generated in the past and is often

 cited as an impediment to the overall vision of reducing the time to insight from vast amounts of data organized in a central data lake. This thesis proposes using similarity metrics to infer the lineage of data artifacts in data lakes. We show the feasibility

 of recovering the lineage of data artifacts under varying assumptions of the availability of metadata using RELIC. We then scale RELIC using sketching and indexing techniques and show that we can answer complex lineage queries accurately and efficiently with

 a suitable index structure. Finally, we introduce FUZZYDATA, a dataframe benchmarking system that can generate dataframe workflows of varying complexity using different dataframe clients to benchmark and test dataframe-based systems.<br>

<br>

Advisors: Aaron Elmore<br>

<br>

Committee Members: Aaron Elmore, Raul Castro Fernandez, and Michael Franklin<br>

<br>

</div>

</span></font></div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText"><br>

<br>

</div>

</span></font></div>

</body>

</html>