[CS] Yue Gong Dissertation Defense/Jul 17, 2025
via cs
cs at mailman.cs.uchicago.edu
Fri Jul 11 09:51:42 CDT 2025
This is an announcement of Yue Gong's Dissertation Defense.
===============================================
Candidate: Yue Gong
Date: Thursday, July 17, 2025
Time: 10 am CST
Remote Location: https://uchicago.zoom.us/j/2937745961?pwd=Q2J5eVVlYnRXcE1ZTW9Lcms5TmYxUT09
Location: JCL 298
Title: Navigating the Open World of Data: Proactive Systems for Data Discovery and Correlation Exploration
Abstract: Today’s data practitioners operate in an open world of data—a setting where data is vast, messy, and poorly documented. There is no complete catalog or consistent schema to rely on, and while this abundance offers great potential, it also makes it difficult to identify the right data to begin with. Practitioners are often overwhelmed, spending 80% of their time on data preparation and only 20% on actual analysis and insight generation. Prior systems have largely treated data discovery as a retrieval task: users must articulate precise queries, and the system returns matching results. But in real-world scenarios, users often lack schema knowledge, vocabulary, or even awareness of what data is available. As a result, the truly relevant results are frequently buried among many irrelevant ones. In such settings, it is not enough for systems to merely return results—they must actively guide users toward useful data, even when user intent is only partially specified. This dissertation introduces systems that embrace this proactive paradigm. Rather than simply responding to queries, they help users efficiently discover data and prioritize results based on relevance to their broader goals. These results include both individual datasets and relationships such as correlations. The systems guide users by (i) interacting with users to learn their preferences directly, (ii) leveraging the structure of the resulting data and the semantics of the tasks, and (iii) using large language models as proxies for human judgment. These methods are applied across a range of scenarios—including identifying relevant tabular datasets, surfacing correlations across space and time, and suggesting new hypotheses. By reducing the effort required to locate meaningful data, this work enables practitioners to spend less time searching and more time reasoning—accelerating progress in data science, scientific research, and decision-making.
Advisors: Raul Castro Fernandez
Committee Members: Raul Castro Fernandez, Sanjay Krishnan, Michael Franklin
More information about the cs
mailing list