[CS] Lalchand Pandia MS Presentation/Oct 14, 2025
via cs
cs at mailman.cs.uchicago.edu
Tue Oct 7 09:19:48 CDT 2025
This is an announcement of Lalchand Pandia's MS Presentation
===============================================
Candidate: Lalchand Pandia
Date: Tuesday, October 14, 2025
Time: 9 am CST
Remote Location: https://uchicago.zoom.us/j/92819186367?pwd=3EDiCw26TaiRy5oXecm1W0zHnKDWJY.1 Meeting ID: 928 1918 6367 Passcode: 968269
Location: JCL 298
Title: RETHINKING DATA SELECTION: THE IMPORTANCE OF COVERAGE OVER DIFFICULTY IN GENERATIVE FINETUNING
Abstract: Selecting high-quality training data can reduce computation cost for LLM finetuning. Prior data selection methods have developed a variety of scores aiming to reflect what kind of information a data instance can provide to the model, in order to subselect instances for fine-tuning—and a majority of this prior work has focused on scores quantifying difficulty. The intuition in such work is that difficult examples are more informative, and can therefore lead to more efficient fine-tuning. While data selection based on difficulty has shown promise for smaller classification models, in this work we find that such scores are ineffective for fine-tuning LLMs on generative tasks because their narrow focus on “difficult” instances fails to capture the necessary diversity of the input data. We find that in generative tasks, such approaches always fall behind random selection, which our analysis reveals is more representative of the underlying input space—i.e., has better coverage. Motivated by this, we propose a simple clustering-based selection method which selects data that is more representative of the underlying input distribution, enabling selection of smaller subsets of training data for generative tasks. Using a case study on Llama 3 8B (Grattafiori et al., 2024) and OLMo 2 7B (OLMo et al., 2025), we find that the coverage-based approach performs well above difficulty scoring, yielding performance at or above that of random selection across a set of generative tasks.
Advisors: Allyson Ettinger
Committee Members: Allyson Ettinger, Michael Maire, and Yuxin Chen
More information about the cs
mailing list