[Colloquium] * TIME CHANGE * for TTIC Talk: John DeNero

Thu Apr 8 13:33:32 CDT 2010

*** TIME CHANGE ***

When:             *Monday, Apr 12 @ 2:00 PM *(formerly 11:00am)*
*

Where:           * TTIC Conference Room #526*, 6045 S Kenwood Ave, 5th Floor

Who:              * **John DeNero*, UC Berkeley

Title:          *      **Large-Context Models for Large-Scale Machine
Translation*****

 Statistical machine translation systems generate their output by stitching
together fragments of example translations.  Two trends are fueling rapid
progress in this field: more example data, and new modeling techniques that
better exploit the information in the data.  In particular, today's massive
data sets allow our statistical models to capture larger linguistic contexts
than ever before.  In this talk, I will give a tour of the three stages of a
modern system: training a model, searching for translations, and selecting
one.  For each stage, I will highlight innovations that have enabled us to
leverage the rich patterns contained in large data sets.

The first stage of translation discovers how two languages correspond to
each other.  Models of correspondence have historically bottomed out in
word-to-word statistics.  The approach I will describe centers instead on
statistics over multi-word phrases, which can capture idiomatic and
non-literal translation patterns.  These patterns are acquired automatically
using nonparametric statistical machinery that scales up naturally with the
data, introducing additional context whenever there is sufficient evidence
to support it.

The second stage searches for translations that are scored highly by a
model.  As our models grow in size and complexity with the data, so does the
scale of this search problem.  I will present a coarse-to-fine approach to
managing this complexity, which uses simpler approximate models to guide and
constrain the full-scale search.  This kind of multi-pass inference is
proving to be a powerful general tool for deploying language processing
systems at scale.

The final stage selects a single output translation from a set of
high-scoring candidates.  The consensus framework I will introduce selects a
translation with high agreement among the multitude of strong candidates.
Theoretically, this approach unifies two distinct translation problems:
selecting final outputs and combining multiple systems together.
Empirically, this work has set new performance records for two of the
world's most successful large-scale, highly distributed translation systems.

BIO:

John DeNero is a Ph.D. candidate in the computer science division at the
University of California, Berkeley.  He studies statistical natural language
processing, working with Professor Dan Klein.  He has held summer research
positions at the Information Sciences Institute at the University of
Southern California and with the translation group of Google Research.  He
plans to graduate in May 2010.

John specializes in large-scale statistical machine translation, an exciting
technology area at the nexus of artificial intelligence, distributed
computing, and computational linguistics.   His research focuses on
developing clean, model-based approaches to translation that can take
advantage of web-scale data sets.  He has also contributed to work on
parsing and unsupervised machine learning.  Please visit his website to
learn more:
http://www.eecs.berkeley.edu/~denero<http://www.eecs.berkeley.edu/%7Edenero>

Host:              Karen Livescu, klivescu at ttic.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20100408/fee26293/attachment.htm