[Colloquium] Ozan Gokdemir MS Presentation/May 9, 2024

Megan Woodward meganwoodward at uchicago.edu
Wed Apr 24 09:00:44 CDT 2024


This is an announcement of Ozan Gokdemir's MS Presentation
===============================================
Candidate: Ozan Gokdemir

Date: Thursday, May 09, 2024

Time: 12 pm CST

Location: JCL 011

Title: Retrieval-Augmented Scientific Question Answering

Abstract: Staying abreast of the latest developments is crucial for scientists to advance their fields through new hypotheses. The rapid rate of advancement in scientific knowledge, however, renders it insurmountable for any individual scientist to holistically process the flow of information. The National Science Foundation reports that the number of science and engineering articles published in open-access journals has increased over 50-fold in the past two decades, from 19,000 in 2003 to 992,000 in 2022. At this scale, computational tools are needed to assist scientists in synthesizing a wide array of recent findings in the literature into novel scientific hypotheses.

This thesis explores the application of Retrieval-Augmented Generation (RAG) in the context of scientific question answering. RAG combines the strengths of information retrieval and language generation to produce contextually relevant and factually grounded answers to knowledge-intensive questions. The RAG model uses a vast scientific literature corpus to retrieve relevant information for answers beyond its training data. As a result, it offers a natural language interface to millions of scientific articles, shifting the workload of the modern scientist from manual literature review to creative hypothesis generation.

Our methodology involves parsing scientific articles from PDFs to raw text at an unprecedented scale. We then semantically chunk the text and encode them into a vector database, both in a distributed fashion. The vector database is used for fast and parallelized neural retrieval which provides relevant excepts to a given user question. Finally, the generator leverages this information, along with the knowledge it obtained from pretraining, to answer the question. We evaluate our model on five scientific question-answering benchmarks and find that our model overperforms GPT-4 by answering over 90% of the questions correctly in the SciQ dataset. The findings suggest that retrieval-augmented generation holds promise as a tool for accelerating scientific discovery by assisting researchers in ingesting scientific literature and forming new hypotheses.

Advisors: Rick Stevens

Committee Members: Ian Foster, Rick Stevens, and Arvind Ramanathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20240424/e3488a2e/attachment.html>


More information about the Colloquium mailing list