[Theory] In-Person Talk: 2/4 Talks at TTIC: Rowan Zellers, University of Washington

Mon Jan 31 13:36:26 CST 2022

*When:        *Friday, February 4th at *10:30am CT*
*Where: *      Talk will be given *live, in-person* at

                    TTIC, 6045 S. Kenwood Avenue

                    5th Floor, Room 530

*Where:*       Zoom Virtual Talk (*register in advance here
<https://uchicagogroup.zoom.us/webinar/register/WN_1KBwOci8S62CnOJ9hVGMKg>*)

*Who: *         Rowan Zellers, University of Washington

*Title:* Grounding Language by Seeing, Hearing, and Interacting

*Abstract:* As humans, our understanding of language is grounded in a rich
mental model about “how the world works” – that we learn through perception
and interaction. We use this understanding to reason beyond what is
literally said, imagining how situations might unfold in the world.
Machines today struggle at making such connections, which limits how they
can be safely used.

In my talk, I will discuss three lines of work to bridge this gap between
machines and humans. I will first discuss how we might measure grounded
understanding. I will introduce a suite of approaches for constructing
benchmarks, using machines in the loop to filter out spurious biases. Next,
I will introduce PIGLeT: a model that learns physical commonsense
understanding by interacting with the world through simulation, using this
knowledge to ground language. PIGLeT learns linguistic form and meaning –
together – and outperforms text-to-text only models that are orders of
magnitude larger. Finally, I will introduce MERLOT, which learns about
situations in the world by watching millions of YouTube videos with
transcribed speech. The model learns to jointly represent video, audio, and
language, together and over time – learning multimodal and neural script
knowledge representations.

Together, these directions suggest a path forward for building machines
that learn language rooted in the world.

*Bio:* Rowan Zellers is a final year PhD candidate at the University of
Washington in Computer Science & Engineering, advised by Yejin Choi and Ali
Farhadi. His research focuses on enabling machines to understand language,
vision, sound, and the world beyond these modalities. He has been
recognized through NSF Graduate Fellowship and a NeurIPS 2021 outstanding
paper award. His work has appeared in several media outlets, including
Wired, the Washington Post, and the New York Times. In the past, he
graduated from Harvey Mudd College with a B.S. in Computer Science &
Mathematics, and has interned at the Allen Institute for AI.

*Host*: *Karen Livescu* <klivescu at ttic.edu>

Mary C. Marre
Faculty Administrative Support
*Toyota Technological Institute*
*6045 S. Kenwood Avenue*
*Chicago, IL  60637*
*mmarre at ttic.edu <mmarre at ttic.edu>*

On Fri, Jan 28, 2022 at 5:27 PM Mary Marre <mmarre at ttic.edu> wrote:

> *When:        *Friday, February 4th at *10:30am CT*
>
> *Where:*       Zoom Virtual Talk (*register in advance here
> <https://uchicagogroup.zoom.us/webinar/register/WN_1KBwOci8S62CnOJ9hVGMKg>*
> )
>
> *Who: *         Rowan Zellers, University of Washington
>
>
> *Title:* Grounding Language by Seeing, Hearing, and Interacting
>
>
> *Abstract:* As humans, our understanding of language is grounded in a
> rich mental model about “how the world works” – that we learn through
> perception and interaction. We use this understanding to reason beyond what
> is literally said, imagining how situations might unfold in the world.
> Machines today struggle at making such connections, which limits how they
> can be safely used.
>
>
>
> In my talk, I will discuss three lines of work to bridge this gap between
> machines and humans. I will first discuss how we might measure grounded
> understanding. I will introduce a suite of approaches for constructing
> benchmarks, using machines in the loop to filter out spurious biases. Next,
> I will introduce PIGLeT: a model that learns physical commonsense
> understanding by interacting with the world through simulation, using this
> knowledge to ground language. PIGLeT learns linguistic form and meaning –
> together – and outperforms text-to-text only models that are orders of
> magnitude larger. Finally, I will introduce MERLOT, which learns about
> situations in the world by watching millions of YouTube videos with
> transcribed speech. The model learns to jointly represent video, audio, and
> language, together and over time – learning multimodal and neural script
> knowledge representations.
>
>
>
> Together, these directions suggest a path forward for building machines
> that learn language rooted in the world.
>
>
>
> *Bio:* Rowan Zellers is a final year PhD candidate at the University of
> Washington in Computer Science & Engineering, advised by Yejin Choi and Ali
> Farhadi. His research focuses on enabling machines to understand language,
> vision, sound, and the world beyond these modalities. He has been
> recognized through NSF Graduate Fellowship and a NeurIPS 2021 outstanding
> paper award. His work has appeared in several media outlets, including
> Wired, the Washington Post, and the New York Times. In the past, he
> graduated from Harvey Mudd College with a B.S. in Computer Science &
> Mathematics, and has interned at the Allen Institute for AI.
>
> *Host*: *Karen Livescu* <klivescu at ttic.edu>
>
>
>
>
> Mary C. Marre
> Faculty Administrative Support
> *Toyota Technological Institute*
> *6045 S. Kenwood Avenue*
> *Chicago, IL  60637*
> *mmarre at ttic.edu <mmarre at ttic.edu>*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20220131/3ca35622/attachment.html>