[Theory] [TTIC Talks] REMINDER: 3/10 Talks at TTIC: SouYoung Jin, Massachusetts Institute of Technology

Brandie Jones bjones at ttic.edu
Wed Mar 9 09:30:00 CST 2022


*When:*        Thursday, March 10th at *11am CT*


*Where:*       *Zoom Virtual Talk* (*register in advance here*
<https://uchicagogroup.zoom.us/webinar/register/WN_ABZkmKvrRmiHQsow5E7jmw>)


*Who: *         SouYoung Jin, Massachusetts Institute of Technology

*Title:*          Cross-Modal Learning for Video Understanding

*Abstract:  *Videos are good sources of knowledge about things we have not
yet experienced. They also show many aspects of human life. Videos have
multiple sources of sensory information. Building a video understanding
system requires computer vision components, such as object detection and
recognition, and knowledge from other domains such as spoken/natural
language processing and cognitive science. Cross-modal learning is a way of
learning that involves information obtained from more than one modality. In
this talk, I will introduce two recent projects on cross-modal learning for
video understanding. In particular, I will talk about the "Spoken Moments"
project, where my collaborators and I collected spoken descriptions of 500K
short videos, to capture natural and concise descriptions on a large scale.
We designed the study to collect only descriptions of events that stood out
in participants’ memory, as we were particularly interested in the video
content that human annotators pay attention to. Using pairs of video and
spoken descriptions, we trained a model with a cross-modal learning
architecture to understand the video content, leading to more human-like
understanding. The model trained on the spoken moments generalizes very
strongly to the other datasets. I will also present our approaches to model
training and future projects in video understanding.

*Bio:*  SouYoung Jin is a postdoctoral associate at the Computer Science
and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts
Institute of Technology (MIT), working with Dr. Aude Oliva. Previously she
earned her PhD in College of Information and Computer Sciences (CICS),
University of Massachusetts, Amherst (UMass Amherst), where she researched
on improving face clustering in videos under Dr. Erik Learned-Miller in the
Computer Vision Lab. Her main research area is in Computer Vision, Machine
Learning and Cognitive Science.

*H**ost: Greg  <greg at ttic.edu>*Shakhnarovich <greg at ttic.edu>



*Brandie Jones *
*Faculty Administrative Support*
Toyota Technological Institute
6045 S. Kenwood Avenue
Chicago, IL  60637
www.ttic.edu


On Wed, Mar 2, 2022 at 1:49 PM Brandie Jones <bjones at ttic.edu> wrote:

> *When:*        Thursday, March 10th at *11am CT*
>
>
> *Where:*       *Zoom Virtual Talk* (*register in advance here*
> <https://uchicagogroup.zoom.us/webinar/register/WN_ABZkmKvrRmiHQsow5E7jmw>
> )
>
>
> *Who: *         SouYoung Jin, Massachusetts Institute of Technology
>
> *Title:*          Cross-Modal Learning for Video Understanding
>
> *Abstract:  *Videos are good sources of knowledge about things we have
> not yet experienced. They also show many aspects of human life. Videos have
> multiple sources of sensory information. Building a video understanding
> system requires computer vision components, such as object detection and
> recognition, and knowledge from other domains such as spoken/natural
> language processing and cognitive science. Cross-modal learning is a way of
> learning that involves information obtained from more than one modality. In
> this talk, I will introduce two recent projects on cross-modal learning for
> video understanding. In particular, I will talk about the "Spoken Moments"
> project, where my collaborators and I collected spoken descriptions of 500K
> short videos, to capture natural and concise descriptions on a large scale.
> We designed the study to collect only descriptions of events that stood out
> in participants’ memory, as we were particularly interested in the video
> content that human annotators pay attention to. Using pairs of video and
> spoken descriptions, we trained a model with a cross-modal learning
> architecture to understand the video content, leading to more human-like
> understanding. The model trained on the spoken moments generalizes very
> strongly to the other datasets. I will also present our approaches to model
> training and future projects in video understanding.
>
> *Bio:  *SouYoung Jin is a postdoctoral associate at the Computer Science
> and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts
> Institute of Technology (MIT), working with Dr. Aude Oliva. Previously she
> earned her PhD in College of Information and Computer Sciences (CICS),
> University of Massachusetts, Amherst (UMass Amherst), where she researched
> on improving face clustering in videos under Dr. Erik Learned-Miller in the
> Computer Vision Lab. Her main research area is in Computer Vision,
> Machine Learning and Cognitive Science.
>
> *H**ost: Greg  <greg at ttic.edu>*Shakhnarovich <greg at ttic.edu>
>
>
> *Brandie Jones *
> *Faculty Administrative Support*
> Toyota Technological Institute
> 6045 S. Kenwood Avenue
> Chicago, IL  60637
> www.ttic.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20220309/72aef853/attachment.html>


More information about the Theory mailing list