[Theory] NOW: 9/7 Thesis Defense: Falcon Dai, TTIC

Mary Marre mmarre at ttic.edu
Wed Sep 7 12:33:21 CDT 2022


*Thesis Defense: Falcon Dai, TTIC*

When: Wednesday, September 7th from *12:30 - 2:30 pm CT*

Virtually:
<https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*Join
Virtually Here
<https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*

Who: Falcon Dai, TTIC


Thesis Title: On Reward Structures of Markov Decision Processes

*Abstract*:
A Markov decision process can be parameterized by a transition kernel and a
reward function. Both play essential roles in the study of reinforcement
learning as evidenced by their presence in the Bellman equations. In my
inquiry of various kinds of "costs'' associated with reinforcement learning
inspired by the demands in robotic applications, I discovered that rewards
prove central to understanding the structure of a Markov decision process
and reward-centric notions can elucidate important concepts in
reinforcement learning.

Specifically, I studied the sample complexity of policy evaluation and
developed a novel estimator with an instance-specific error bound of
$\widetilde{O}(\sqrt{\nicefrac{\tau_s}{n}})$ for estimating a single state
value. Under the online regret minimization setting, I refined the
transition-based MDP constant, diameter, into a reward-based constant,
maximum expected hitting cost, and with it, provided a theoretical
explanation for how a well-known technique, potential-based reward shaping,
could accelerate learning with expert knowledge. In an attempt to study
safe reinforcement learning, I modeled hazardous environments with
irrecoverability and proposed a quantitative notion of safe learning via
reset efficiency. In this setting, I modified a classic algorithm to
account for resets achieving promising preliminary numerical results.
Lastly, for MDPs with multiple reward functions, I developed a planning
algorithm that computationally efficiently finds Pareto optimal stochastic
policies.

*Thesis Advisor:** Matthew Walter* <mwalter at ttic.edu>


Mary C. Marre
Faculty Administrative Support
*Toyota Technological Institute*
*6045 S. Kenwood Avenue*
*Chicago, IL  60637*
*mmarre at ttic.edu <mmarre at ttic.edu>*


On Wed, Sep 7, 2022 at 11:56 AM Mary Marre <mmarre at ttic.edu> wrote:

> *Thesis Defense: Falcon Dai, TTIC*
>
> When: Wednesday, September 7th from *12:30 - 2:30 pm CT*
>
> Virtually:
> <https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*Join
> Virtually Here
> <https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*
>
> Who: Falcon Dai, TTIC
>
>
> Thesis Title: On Reward Structures of Markov Decision Processes
>
> *Abstract*:
> A Markov decision process can be parameterized by a transition kernel and
> a reward function. Both play essential roles in the study of reinforcement
> learning as evidenced by their presence in the Bellman equations. In my
> inquiry of various kinds of "costs'' associated with reinforcement learning
> inspired by the demands in robotic applications, I discovered that rewards
> prove central to understanding the structure of a Markov decision process
> and reward-centric notions can elucidate important concepts in
> reinforcement learning.
>
> Specifically, I studied the sample complexity of policy evaluation and
> developed a novel estimator with an instance-specific error bound of
> $\widetilde{O}(\sqrt{\nicefrac{\tau_s}{n}})$ for estimating a single state
> value. Under the online regret minimization setting, I refined the
> transition-based MDP constant, diameter, into a reward-based constant,
> maximum expected hitting cost, and with it, provided a theoretical
> explanation for how a well-known technique, potential-based reward shaping,
> could accelerate learning with expert knowledge. In an attempt to study
> safe reinforcement learning, I modeled hazardous environments with
> irrecoverability and proposed a quantitative notion of safe learning via
> reset efficiency. In this setting, I modified a classic algorithm to
> account for resets achieving promising preliminary numerical results.
> Lastly, for MDPs with multiple reward functions, I developed a planning
> algorithm that computationally efficiently finds Pareto optimal stochastic
> policies.
>
> *Thesis Advisor:** Matthew Walter* <mwalter at ttic.edu>
>
>
>
>
> Mary C. Marre
> Faculty Administrative Support
> *Toyota Technological Institute*
> *6045 S. Kenwood Avenue*
> *Chicago, IL  60637*
> *mmarre at ttic.edu <mmarre at ttic.edu>*
>
>
> On Tue, Sep 6, 2022 at 3:48 PM Mary Marre <mmarre at ttic.edu> wrote:
>
>> *Thesis Defense: Falcon Dai, TTIC*
>>
>> When: Wednesday, September 7th from *12:30 - 2:30 pm CT*
>>
>> Virtually:
>> <https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*Join
>> Virtually Here
>> <https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*
>>
>> Who: Falcon Dai, TTIC
>>
>>
>> Thesis Title: On Reward Structures of Markov Decision Processes
>>
>> *Abstract*:
>> A Markov decision process can be parameterized by a transition kernel and
>> a reward function. Both play essential roles in the study of reinforcement
>> learning as evidenced by their presence in the Bellman equations. In my
>> inquiry of various kinds of "costs'' associated with reinforcement learning
>> inspired by the demands in robotic applications, I discovered that rewards
>> prove central to understanding the structure of a Markov decision process
>> and reward-centric notions can elucidate important concepts in
>> reinforcement learning.
>>
>> Specifically, I studied the sample complexity of policy evaluation and
>> developed a novel estimator with an instance-specific error bound of
>> $\widetilde{O}(\sqrt{\nicefrac{\tau_s}{n}})$ for estimating a single state
>> value. Under the online regret minimization setting, I refined the
>> transition-based MDP constant, diameter, into a reward-based constant,
>> maximum expected hitting cost, and with it, provided a theoretical
>> explanation for how a well-known technique, potential-based reward shaping,
>> could accelerate learning with expert knowledge. In an attempt to study
>> safe reinforcement learning, I modeled hazardous environments with
>> irrecoverability and proposed a quantitative notion of safe learning via
>> reset efficiency. In this setting, I modified a classic algorithm to
>> account for resets achieving promising preliminary numerical results.
>> Lastly, for MDPs with multiple reward functions, I developed a planning
>> algorithm that computationally efficiently finds Pareto optimal stochastic
>> policies.
>>
>> *Thesis Advisor:** Matthew Walter* <mwalter at ttic.edu>
>>
>>
>> Mary C. Marre
>> Faculty Administrative Support
>> *Toyota Technological Institute*
>> *6045 S. Kenwood Avenue*
>> *Chicago, IL  60637*
>> *mmarre at ttic.edu <mmarre at ttic.edu>*
>>
>>
>> On Tue, Aug 30, 2022 at 4:07 PM Mary Marre <mmarre at ttic.edu> wrote:
>>
>>> *Thesis Defense: Falcon Dai, TTIC*
>>>
>>> When: Wednesday, September 7th from *12:30 - 2:30 pm CT*
>>>
>>> Virtually:
>>> <https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*Join
>>> Virtually Here
>>> <https://uchicago.zoom.us/j/98534120153?pwd=SmRDMFo1UTA1M3pNZEZOblhkWG9yQT09>*
>>>
>>> Who: Falcon Dai, TTIC
>>>
>>>
>>> Thesis Title: On Reward Structures of Markov Decision Processes
>>>
>>> *Abstract*:
>>> A Markov decision process can be parameterized by a transition kernel
>>> and a reward function. Both play essential roles in the study of
>>> reinforcement learning as evidenced by their presence in the Bellman
>>> equations. In my inquiry of various kinds of "costs'' associated with
>>> reinforcement learning inspired by the demands in robotic applications, I
>>> discovered that rewards prove central to understanding the structure of a
>>> Markov decision process and reward-centric notions can elucidate important
>>> concepts in reinforcement learning.
>>>
>>> Specifically, I studied the sample complexity of policy evaluation and
>>> developed a novel estimator with an instance-specific error bound of
>>> $\widetilde{O}(\sqrt{\nicefrac{\tau_s}{n}})$ for estimating a single state
>>> value. Under the online regret minimization setting, I refined the
>>> transition-based MDP constant, diameter, into a reward-based constant,
>>> maximum expected hitting cost, and with it, provided a theoretical
>>> explanation for how a well-known technique, potential-based reward shaping,
>>> could accelerate learning with expert knowledge. In an attempt to study
>>> safe reinforcement learning, I modeled hazardous environments with
>>> irrecoverability and proposed a quantitative notion of safe learning via
>>> reset efficiency. In this setting, I modified a classic algorithm to
>>> account for resets achieving promising preliminary numerical results.
>>> Lastly, for MDPs with multiple reward functions, I developed a planning
>>> algorithm that computationally efficiently finds Pareto optimal stochastic
>>> policies.
>>>
>>> *Thesis Advisor:** Matthew Walter* <mwalter at ttic.edu>
>>>
>>>
>>>
>>>
>>> Mary C. Marre
>>> Faculty Administrative Support
>>> *Toyota Technological Institute*
>>> *6045 S. Kenwood Avenue*
>>> *Chicago, IL  60637*
>>> *mmarre at ttic.edu <mmarre at ttic.edu>*
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20220907/43291992/attachment-0001.html>


More information about the Theory mailing list