[Colloquium] REMINDER: 3/1 Talks at TTIC: Ofir Press, University of Washington

Tue Feb 28 16:03:26 CST 2023

*When:*        Wednesday, March 1, 2023 at* 11:30** a**m CT   *

*Where:       *Talk will be given *live, in-person* at

                   TTIC, 6045 S. Kenwood Avenue

                   5th Floor, Room 530

*Virtually:*  *via* Panopto (*livestream
<https://uchicago.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=cb224c62-d7b4-47bf-98cf-afb2010b249c>*
)

*Who: *        Ofir Press, University of Washington

------------------------------
*Title:   *     Guidance Helps Where Scale Doesn't in Language Modeling

*Abstract:* Language models (LMs) are at the core of almost all state of
the art natural language processing systems on almost every benchmark.
Recent papers, such as Brown et al. 2020 and Hoffmann et al. 2022 have
shown that scaling up the size of these models leads to better results. But
is scaling all we need in order to improve language models?

In this talk I argue that the answer is no, by presenting three studies
that show properties of LMs that are not improved with scale. In addition,
I will show how to tackle these issues without actually increasing the size
on disk, memory usage, or runtime of the LM. In each case, I accomplish it
by adding a new kind of guidance to the model.

In Press & Wolf 2017 we showed that the decoding mechanism in LMs contains
word representations, and that in models of different sizes, the decoder
word representations are of lower quality than the ones in the encoder. We
then show that by using the same representations twice (in both the encoder
and the decoder) we improve LM performance while decreasing its size.

Memory constraints imply that LMs have to be trained on limited segments of
text. For example, GPT-3 (Brown et al. 2020) was trained on text segments
that are 4,096 tokens long. Can these models summarize text sequences that
are longer than the ones they observed at training? Can they make code
predictions for code files that are longer than the ones they were shown
during  training? In Press et al. 2021 we show that existing LMs cannot
process text segments that are longer than the ones they were trained on.
We present a new method (ALiBi) that allows LMs to efficiently consume
sequences that are longer than the ones they observed at training. ALiBi
achieves this by guiding the LM to pay less attention to words that are
further away.

Finally, in Press et al. 2022 we show that LMs are able to reason over
facts observed during training to answer novel questions that they have
never previously seen. But in about 40% of cases, they are not able to
accomplish basic reasoning over facts that they are able to recall, and
this does not improve with scale. We show that by adding guidance to the
way we prompt LMs, by having them ask and answer sub-questions before
answering the main complex question,  we are able to substantially improve
their reasoning capabilities.

These methods have been integrated in many state-of-the-art language and
translation models, including OpenAI's GPT, Google's BERT, BigScience's
BLOOM and Microsoft's, Meta's and Amazon's translation models.

*Bio: *I am a PhD candidate (ABD) at the Paul G. Allen School for Computer
Science & Engineering at the University of Washington, where I am very
fortunate to be advised by Noah Smith. During my PhD I spent two years as a
visiting researcher at Facebook AI Research Labs on Luke Zettlemoyer's team
where I mainly worked with Mike Lewis. Prior to that, in the summer of 2019
I interned at Facebook AI Research with Omer Levy. Towards the end of my
PhD I spent half a year as a visiting researcher at MosaicML on Jonathan
Frankle's team. Before starting my PhD I completed my Bachelor's and
Master's degrees in Computer Science at Tel Aviv University (where I was
advised by Lior Wolf and also worked with Jonathan Berant). Between my
Bachelor's and Master's degrees I was a software developer for a year.

*Host:* *Karen Livescu*
<klivescu at ttic.edu>

Mary C. Marre
Faculty Administrative Support
*Toyota Technological Institute*
*6045 S. Kenwood Avenue, Rm 517*
*Chicago, IL  60637*
*773-834-1757*
*mmarre at ttic.edu <mmarre at ttic.edu>*

On Thu, Feb 23, 2023 at 10:40 AM Mary Marre <mmarre at ttic.edu> wrote:

> *When:*        Wednesday, March 1, 2023 at* 11:30** a**m CT   *
>
>
> *Where:       *Talk will be given *live, in-person* at
>
>                    TTIC, 6045 S. Kenwood Avenue
>
>                    5th Floor, Room 530
>
>
> *Virtually:*  *via* Panopto (*livestream
> <https://uchicago.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=cb224c62-d7b4-47bf-98cf-afb2010b249c>*
> )
>
>
> *Who: *        Ofir Press, University of Washington
>
>
> ------------------------------
> *Title:   *     Guidance Helps Where Scale Doesn't in Language Modeling
> *
>
>
>                                   Abstract:* Language models (LMs) are at
> the core of almost all state of the art natural language processing systems
> on almost every benchmark. Recent papers, such as Brown et al. 2020 and
> Hoffmann et al. 2022 have shown that scaling up the size of these models
> leads to better results. But is scaling all we need in order to improve
> language models?
>
> In this talk I argue that the answer is no, by presenting three studies
> that show properties of LMs that are not improved with scale. In addition,
> I will show how to tackle these issues without actually increasing the size
> on disk, memory usage, or runtime of the LM. In each case, I accomplish it
> by adding a new kind of guidance to the model.
>
>
>
> In Press & Wolf 2017 we showed that the decoding mechanism in LMs contains
> word representations, and that in models of different sizes, the decoder
> word representations are of lower quality than the ones in the encoder. We
> then show that by using the same representations twice (in both the encoder
> and the decoder) we improve LM performance while decreasing its size.
>
>
>
> Memory constraints imply that LMs have to be trained on limited segments
> of text. For example, GPT-3 (Brown et al. 2020) was trained on text
> segments that are 4,096 tokens long. Can these models summarize text
> sequences that are longer than the ones they observed at training? Can they
> make code predictions for code files that are longer than the ones they
> were shown during  training? In Press et al. 2021 we show that existing LMs
> cannot process text segments that are longer than the ones they were
> trained on. We present a new method (ALiBi) that allows LMs to efficiently
> consume sequences that are longer than the ones they observed at training.
> ALiBi achieves this by guiding the LM to pay less attention to words that
> are further away.
>
>
>
> Finally, in Press et al. 2022 we show that LMs are able to reason over
> facts observed during training to answer novel questions that they have
> never previously seen. But in about 40% of cases, they are not able to
> accomplish basic reasoning over facts that they are able to recall, and
> this does not improve with scale. We show that by adding guidance to the
> way we prompt LMs, by having them ask and answer sub-questions before
> answering the main complex question,  we are able to substantially improve
> their reasoning capabilities.
>
>
> These methods have been integrated in many state-of-the-art language and
> translation models, including OpenAI's GPT, Google's BERT, BigScience's
> BLOOM and Microsoft's, Meta's and Amazon's translation models.
>
> *Bio: *I am a PhD candidate (ABD) at the Paul G. Allen School for
> Computer Science & Engineering at the University of Washington, where I am
> very fortunate to be advised by Noah Smith. During my PhD I spent two years
> as a visiting researcher at Facebook AI Research Labs on Luke Zettlemoyer's
> team where I mainly worked with Mike Lewis. Prior to that, in the summer of
> 2019 I interned at Facebook AI Research with Omer Levy. Towards the end of
> my PhD I spent half a year as a visiting researcher at MosaicML on Jonathan
> Frankle's team. Before starting my PhD I completed my Bachelor's and
> Master's degrees in Computer Science at Tel Aviv University (where I was
> advised by Lior Wolf and also worked with Jonathan Berant). Between my
> Bachelor's and Master's degrees I was a software developer for a year.
>
> *Host:* *Karen Livescu*
> <klivescu at ttic.edu>
>
>
>
> Mary C. Marre
> Faculty Administrative Support
> *Toyota Technological Institute*
> *6045 S. Kenwood Avenue, Rm 517*
> *Chicago, IL  60637*
> *773-834-1757*
> *mmarre at ttic.edu <mmarre at ttic.edu>*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20230228/db1bb1dd/attachment-0001.html>