<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><i>Department  of Computer Science Seminar</i></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><i><br>


</i></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>Yuanhao Wang</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>PhD Student</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>Princeton University </b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b><br>


</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>Wednesday, October 25th</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>11:00am - 12:00pm</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>In Person: John Crerar Library 390</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b><br>


</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>Title: Is RLHF more difficult than standard RL? A view from reductions</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b><br>


</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>Abstract:</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;">Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less


 information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based


 RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where


 the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences


 only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees


 when K-wise comparisons are available.<b><br>


</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b><br>


</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><b>Bio:</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;">Yuanhao Wang is a fourth-year PhD student at the Computer Science Department of Princeton University. He is advised by Chi Jin. Prior to Princeton, he received his bachelor’s degree in Computer


 Science from Yao Class at Tsinghua University. His research interests include reinforcement learning theory, learning in games and minimax optimization. He has received the best paper award in the ICLR 2022 workshop on Gamification and Multiagent Solutions.<b><br>


</b></div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><br>


</div>


<div style="font-family: Roboto, Helvetica, Arial, sans-serif;"><br>


</div>


<img alt="profile.jpeg" src="cid:DB35AC1F-C3C8-464B-87D5-AA727F5FD48B">


<div><br>


</div>


<div>---<br class="Apple-interchange-newline">


<div>


<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">


<div>Holly Santos<br>


Executive Assistant to Hank Hoffmann, Chairman<br>


Department of Computer Science<br>


The University of Chicago<br>


5730 S Ellis Ave-217   Chicago, IL 60637<br>


P: 773-834-8977<br>


hsantos@uchicago.edu</div>


<div><br>


</div>


</div>


<br class="Apple-interchange-newline">


<br class="Apple-interchange-newline">


</div>


<br>


</div>


</body>


</html>