[Theory] REMINDER: [Talks at TTIC] 5/15 TTIC Colloquium: Simon Du, University of Washington

Tue May 13 11:00:00 CDT 2025

*When:*        Thursday, May 15th at *11am CT*

*Where:       *Talk will be given *live, in-person* at

                       TTIC, 6045 S. Kenwood Avenue

                       5th Floor,* Room **529*  **Please note the room
location has changed* *

*Virtually:*  via Panopto (livestream
<https://uchicago.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=dc3e2b16-3fae-4961-8dc4-b2d800fb6856>
)

*Who: *         Simon Du, University of Washington

*Title:*          Reinforcement Learning for Reasoning in Large Language
Models with One Training Example
*Abstract:  *We show that reinforcement learning with verifiable reward
using one training example (1-shot RLVR) is effective in incentivizing the
math reasoning capabilities of large language models (LLMs). Applying RLVR
to the base model Qwen2.5-Math-1.5B, we identify a single example that
elevates model performance on MATH500 from 36.0% to 73.6%, and improves the
average performance across six common mathematical reasoning benchmarks
from 17.6% to 35.7%. This result matches the performance obtained using the
1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the
aforementioned example. Similar substantial improvements are observed
across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct,
DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different
math examples (many of which yield approximately 30% or greater improvement
on MATH500 when employed as a single training example).

In addition, we identify some interesting phenomena during 1-shot RLVR,
including cross-domain generalization, increased frequency of
self-reflection, and sustained test performance improvement even after the
training accuracy has saturated, a phenomenon we term post-saturation
generalization. Moreover, we verify that the effectiveness of 1-shot RLVR
primarily arises from the policy gradient loss, distinguishing it from the
"grokking" phenomenon. We also show the critical role of promoting
exploration (e.g., by adding entropy loss with an appropriate coefficient)
in 1-shot RLVR training. As a bonus, we observe that applying entropy loss
alone, without any outcome reward, significantly enhances
Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can
inspire future work on RLVR data efficiency and encourage a re-examination
of both recent progress and the underlying mechanisms in RLVR.

*Host: Zhiyuan Li <zhiyuanli at ttic.edu>*

-- 
*Brandie Jones *
*Executive **Administrative Assistant*
Toyota Technological Institute
6045 S. Kenwood Avenue
Chicago, IL  60637
www.ttic.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20250513/a5882886/attachment.html>