[Colloquium] TTIC Talk: John Blitzer, UC Berkeley

Tue Jan 12 11:30:41 CST 2010

*REMINDER*

When:             *Wednesday, Jan 13 @ 11:00am*, light lunch will follow

Where:           * TTI-C Conference Room #526*, 6045 S Kenwood Ave

Who:               *John Blitzer*, UC Berkeley

Title:          *      **Learning Correspondence Representations for Natural
Language Processing*

 The key to creating scalable, robust natural language processing (NLP)
systems is to exploit correspondences between known and unknown linguistic
structure.  Natural language processing has experienced tremendous success
over the past two decades, but our most successful systems are still limited
to the domains and languages where we have large amounts of hand-annotated
data.  Unfortunately, these domains and languages represent a tiny portion
of the total linguistic data in the world.  No matter the task, we always
encounter unknown linguistic features like words and syntactic constituents
that we have never observed before when estimating our models.  This talk is
about linking these linguistic features to one another through
correspondence representations.

The first part describes a technique to learn lexical correspondences for
domain adaptation of sentiment analysis systems.  These systems predict the
general attitude of an essay toward a particular topic.  In this case, words
which are highly predictive in one domain may not be present in another.  We
show how to build a correspondence representation between words in different
domains using projections to low-dimensional, real-valued spaces.  Unknown
words are projected onto this representation and related directly to known
features via Euclidean distance.  The correspondence representation allows
us to train significantly more robust models in new domains, and we achieve
a 40% relative reduction in error due to adaptation over a state-of-the-art
system.

The second part describes a technique to learn syntactic correspondences
between languages for machine translation.  Syntactic machine translation
models exploit syntactic correspondences to translate grammatical structures
(e.g. subjects, verbs, and objects) from one language to another.  Given
pairs of sentences which are translations of one another, we build a latent
correspondence grammar which links grammatical structures in one language to
grammatical structures in another.  The syntactic correspondences induced by
our grammar significantly improve a state-of-the-art Chinese-English machine
translation system.
*
** Bio:

John Blitzer is a postdoctoral fellow in the computer science department at
the University of California, Berkeley, working with Dan Klein.  He
completed his PhD in computer science at the University of Pennsylvania
under Fernando Pereira, and in 2008 spent 6 months as a visiting researcher
in the natural language computing group at Microsoft Research Asia.

John's research focuses on applications of machine learning to natural
language.  In particular, he is interested in exploiting unlabeled data and
other sources of side information to improve supervised models.  He has
applied these techniques to tagging, parsing, entity recognition, web
search, and machine translation.  To learn more about John's research
interests, please visit his web page: http://john.blitzer.com

* Host:              Greg Shakhnarovich, gregory at ttic.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20100112/053f970f/attachment.htm