[Theory] 8/23 Thesis Defense: Davis Yoshida, TTIC

Mary Marre mmarre at ttic.edu
Sat Aug 19 15:21:27 CDT 2023


*When*:    Wednesday, August 23, at 3 PM CST

*Virtually*: Talk will be held via Zoom *here
<https://uchicago.zoom.us/j/99792875029?pwd=aHIwZERjTHhTeHQwTXhkOU92T2o2UT09>*

*Who*:       Davis Yoshida, TTIC

------------------------------
*Title:*       Making the Most of your Model: Methods for Finetuning and
Applying Pretrained Transformers

*Abstract:* Recent progress in NLP has been dominated by large pretrained
models. While the rate of improvement has been astounding, we are far from
knowing how to optimally use even the models we already have. Improvements
to our knowledge about how to make use of pretrained models have a
multiplicative benefit, because they increase the utility of all extant
pretrained models, even those trained for niche domains which are not well
covered by even state of the art large language models (LLMs).

This thesis provides methods and analysis of models which make progress on
this goal. The techniques outlined are task agnostic, and should provide
benefit when used with nearly any transformer LM. We introduce two new
finetuning methods which add new capabilities to the models they are used
on. The first adds a recurrence mechanism, which removes the fixed-window
sized constraint and improves the efficiency of a transformer decoder. The
second allows MLMs to be used for initialization of both the encoder and
decoder of a non-autoregressive sequence-to-sequence transformer, opening
up generative applications of models which were previously only used for
natural language understanding tasks.

We also introduce two new techniques for improving the quality of
predictions of any transformer decoder without additional finetuning. One,
hidden state optimization, can be applied to any transformer decoder to
improve the quality of predictions at inference time, especially for
few-shot classification. The other, conditional beam search, allows
practitioners to search for NLG model outputs with high likelihood while
conditioning on the event that the output is not degenerate (e.g. empty,
repetitive, etc.).

Finally, we provide theoretical and empirical insights on the divergence of
model-likelihood and output quality which has widely been observed in prior
work. These insights apply to any model which represents a distribution
over text, and apply to language models which are not transformers or even
autoregressive. We argue that the NLP community has, to some extent,
misunderstood the implications of these findings, and encourage a point of
view which has more nuance.

Taken together, the findings in this thesis should allow NLP practitioners
to make much more effective use of pretrained models, either those that
already exist or ones that will be created in the future.

*Advisor*: Kevin Gimpel <kgimpel at ttic.edu>



Mary C. Marre
Faculty Administrative Support
*Toyota Technological Institute*
*6045 S. Kenwood Avenue, Rm 517*
*Chicago, IL  60637*
*773-834-1757*
*mmarre at ttic.edu <mmarre at ttic.edu>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20230819/9030a000/attachment.html>


More information about the Theory mailing list