[Theory] 5/4 Thesis Defense: Shane Settle, TTIC

Tue Apr 25 19:08:18 CDT 2023

*When*:    Thursday, May 4th from *9:30 - 11:30 am CT*

*Where*:  Talk will be given *live, in-person* at
              TTIC, 6045 S. Kenwood Avenue
              5th Floor, *Room 529*

*Virtually*: attend virtually *here
<https://uchicagogroup.zoom.us/j/99583225252?pwd=eDRnc040b2t1eHBUQ3kxM2I1SUZwZz09>*

*Who*:      Shane Settle, TTIC

------------------------------
*Title:*      Neural Approaches to Spoken Content Embedding

*Abstract:* Learning to compare spoken segments is a central operation to
speech processing. Traditional approaches in this area have favored
frame-level dynamic programming algorithms, such as dynamic time warping,
because they require no supervision, but they are limited in performance
and efficiency. As an alternative, acoustic word
embeddings—fixed-dimensional vector representations of variable-length
spoken word segments—have begun to be considered for such tasks as well.
These embeddings can be learned discriminatively such that they are similar
for speech segments corresponding to the same word, while being dissimilar
for segments corresponding to different words. Acoustic word embedding
models also speed up segment comparison, which reduces to a dot product
between segment embedding vectors. However, the current space of such
discriminative embedding models, training approaches, and their application
to real-world downstream tasks is limited.

We start by considering “single-view” training losses where the goal is to
learn an acoustic word embedding model that separates same-word and
different-word spoken segment pairs. Then, we consider “multi-view”
contrastive losses. In this setting, acoustic word embeddings are learned
jointly with embeddings of character sequences to generate acoustically
grounded embeddings of written words, or acoustically grounded word
embeddings; such embeddings have been used to improve speech retrieval,
recognition, and spoken term discovery.

In this thesis, we present new discriminative acoustic word embedding (AWE)
and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs), extend them for multilingual training and
evaluation, and apply them in downstream tasks across a variety of resource
levels.

*Thesis Advisor*: *Karen Livescu* <klivescu at ttic.edu>

Mary C. Marre
Faculty Administrative Support
*Toyota Technological Institute*
*6045 S. Kenwood Avenue, Rm 517*
*Chicago, IL  60637*
*773-834-1757*
*mmarre at ttic.edu <mmarre at ttic.edu>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20230425/54904ee3/attachment.html>