[Theory] NOW: [Talks at TTIC] 11/20 Young Researcher Seminar Series: Peter Hase, UNC

Wed Nov 20 10:55:00 CST 2024

*When:    *Wednesday, November 20th* at **11AM CT*

*Where:   *Talk will be given *live, in-person* at

                    TTIC, 6045 S. Kenwood Avenue

                    5th Floor, Room 530

*Virtually: *via Panopto (Livestream
<https://uchicago.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=cb670e00-20e7-46a2-a6a4-b20b00f9f619>
)

*Who:      *Peter Hase, University of North Carolina at Chapel Hill

*Title:*       AI Safety Through Interpretable and Controllable Language
Models

*Abstract: *In a 2022 survey, 37% of NLP experts agreed that "AI decisions
could cause nuclear-level catastrophe'' in this century. This survey was
conducted prior to the release of ChatGPT. The research community’s
now-common concern about catastrophic risks from AI highlights that
long-standing problems in AI safety are as important as ever. In this talk,
I will describe research on two core problems at the intersection of NLP
and AI safety: (1) interpretability and (2) controllability. We need
interpretability methods to verify that models use acceptable and
generalizable reasoning to solve tasks. Controllability refers to our
ability to steer individual behaviors in models on demand, which is helpful
since pretrained models will need continual adjustment of specific
knowledge and beliefs about the world. This talk will cover recent work on
(1) open problems in interpretability, including mechanistic
interpretability and chain-of-thought faithfulness, (2) fundamental
problems with model editing, viewed through the lens of belief revision,
and (3) scalable oversight, with a focus on weak-to-strong generalization.
Together, these lines of research aim to develop rigorous technical
foundations for ensuring the safety of increasingly capable AI systems.

*Bio:* Peter Hase is an AI Resident at Anthropic, working on the Alignment
Science team. He recently completed his PhD at the University of North
Carolina at Chapel Hill. His research focuses on NLP and AI Safety, with a
particular emphasis on techniques for explaining and controlling model
behavior. He has previously worked at AI2, Google, and Meta.

*Host: Karen Livescu <klivescu at ttic.edu>*

-- 
*Brandie Jones *
*Executive **Administrative Assistant*
Toyota Technological Institute
6045 S. Kenwood Avenue
Chicago, IL  60637
www.ttic.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20241120/aa095937/attachment.html>