<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Aptos;

        panose-1:2 11 0 4 2 2 2 2 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        font-size:11.0pt;

        font-family:"Aptos",sans-serif;

        mso-ligatures:standardcontextual;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:11.0pt;

        mso-ligatures:none;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style>

</head>

<body lang="EN-US" link="#467886" vlink="#96607D" style="word-wrap:break-word">

<div class="WordSection1">

<p class="MsoNormal">This is an announcement of Qiming Wang<b> </b>Dissertation Defense</p>

<p class="MsoNormal">===============================================</p>

<p class="MsoNormal"><b>Candidate:</b> Qiming Wang</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><b>Date:</b> Thursday, October 3rd, 2024</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><b>Time:</b>  2pm -4pm CT</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><b>Location: </b>  JCL 236</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><b>Title:</b>  Tabular Data Extraction and Discovery Using Natural Language Questions</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><b>Abstract:</b> Tabular data extraction and discovery from a large corpus are two long-standing challenges in the data management community. Traditional solutions involve much human effort in writing rules or annotating training data and

 this expensive manual work has to be repeated for each new domain of source corpus making these solutions not scalable. So can we avoid this repetitive, expensive manual work but still maintain comparable performance across different datasets? In this dissertation,

 we show that we can do both, 1)By reducing table extraction from a large text corpora as the task of question answering over that corpora, it is possible to build a table extraction system that generalizes to other domain once trained, so that it avoids repeating

 manual work in collecting training data for new table domains. 2) Given any table corpora, by learning from the corpora itself with the help of a large language model, we can build a table discovery system that matches the query quality of those trained on

 human-annotated data. Specifically, we build three systems/tools to demonstrate they actually work. 1) FabricQA-Extractor, a system to extract tables from a text corpora using natural language questions. 2) SOLO: a self-supervised system for table discovery

 using natural language questions. 3) Pneuma-Benchmark: an automatic tool to evaluate models/systems for table discovery using natural language questions. These works collectively contribute a solution with little human effort to tabular data extraction and

 discovery.</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><b>Advisors:</b>  Raul Castro Fernandez</p>

<p class="MsoNormal"><b> </b></p>

<p class="MsoNormal"><b>Committee Members:</b> Raul Castro Fernandez, Chard Kyle, Sanjay Krishnan</p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><b> </b></p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

</body>

</html>