[CS] REMINDER: David Reber MS PresentationApr 3, 2025
via cs
cs at mailman.cs.uchicago.edu
Mon Mar 31 10:12:44 CDT 2025
This is an announcement of David Reber's MS Presentation
===============================================
Candidate: David Reber
Date: Thursday, April 03, 2025
Time: 2:30 pm CST
Location: JCL 298
Title: RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals
Abstract: Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. We develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactual examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.
Advisors: Victor Veitch
Committee Members: Victor Veitch, Haifeng Xu, Ari Holtzman
More information about the cs
mailing list