[CS] Suhail Rehman MS Presentation/May 19, 2021

pbaclawski at uchicago.edu pbaclawski at uchicago.edu
Thu May 6 12:11:30 CDT 2021


This is an announcement of Suhail Rehman's MS Presentation.
===============================================
Date: Wednesday, May 19, 2021

Time: 10:00AM CST

Location: via zoom

https://uchicago.zoom.us/j/95576854504?pwd=eEhDeGcvV28vdDNodGhWNlZoR1JIZz09
Meeting ID: 955 7685 4504
Passcode: 152945

M.S. Candidate: Suhail Rehman

M.S. Paper Title: Sifting Through Data Relics: An Automated Framework for Retrospective Analysis of Data Artifacts

Abstract: Over the data science lifecycle, data scientists work with datasets using a variety of tools, including spreadsheets, computational notebooks, and ad-hoc scripts,for a range of tasks, including cleaning, integration, feature engineering, and visualization. This ad-hoc, heterogeneous process of data science typically results in multiple versions of the dataset(s) recorded as artifacts, operated on by various tools,even within a single data science workflow.Lineage information, including source datasets, data transformation programs or scripts, or manual annotations, is rarely captured, making it difficult to infer the relationships between artifacts in a given workflow retrospectively.We introduce the problem of retrospective lineage inference, wherein, given a collection of tabular artifacts, the goal is to reconstruct a lineage graph that resembles their true evolution in the data analysis workflows that generated them,aiding reproducibility, explain ability, and long-term maintenance. Our technique for retrospective lineage inference, RELIC, differentiates between operations that keep row and column correspondences intact, and those that do not; we use fine-grained similarity metrics to infer relationships for the former and targeted set containment-based detectors for the latter. RELIC can reconstruct lineage graphs from artifacts in a representative sample of real-world Jupyter notebooks with an average F1 score of ~0.91 without access to code, documentation, or other metadata.

Advisor: Aaron Elmore

Committee Members: Aaron Elmore, Raul Castro Fernandez, and Michael Franklin




More information about the cs mailing list