[Colloquium] REMINDER: 4/29 Talks at TTIC: Mesrob Ohannessian, UC San Diego

Mary Marre mmarre at ttic.edu
Fri Apr 29 10:39:55 CDT 2016


When:     Friday, April 29th at 11:00 am

Where:    TTIC, 6045 S Kenwood Avenue, 5th Floor, Room 526

Who:      Mesrob Ohannessian, UC San Diego



Title: Handling and Harnessing Data Scarcity

Abstract: Two complementary problems arise in modern data analysis. On the
one hand, despite its abundance, data is often scattered and we need to
make inferences about rare and unseen events. On the other hand,
computational costs may force us to prune data before analysis, thus
imposing artificial data scarcity. How can we handle scarce data? And how
can we harness scarcity to reap computational savings?

A simple model of an intrinsically rare event is the set of all unseen
symbols in independent samples from a discrete distribution. Alan Turing
and Jack Good proposed a simple estimator for the probability of this
event. We first dismantle the impeccable image of this estimator, by
showing that it can fail to be consistent in relative error, for even the
simplest light-tailed distributions. We then give a "no free lunch"
theorem: no estimator can be universally consistent in relative error. This
formalizes the intuition that to reasonably infer about rare events,
further structure on the distribution family is necessary. In this light,
the old Good-Turing estimator acquires a new reputation, as a highly
effective specialized rare probability estimator for heavy-tailed
distributions. This explains its success in areas where such distributions
arise, such as in natural language modeling. Beyond Good-Turing, this gives
rise to a methodology that brings discrete rare event inference closer to
classical tail inference and opens the door to streamlined estimation
techniques that are inspired by extreme value theory.

We then revisit one of the fundamental questions facing large-scale data
analysis: is it possible to trade off statistical risk with computational
time? We advance data scarcity as a mechanism to achieve such tradeoffs. To
show that the principle applies just as well to non-convex situations, we
concretize it for the probabilistic k-means problem, also known as vector
quantization. In particular, we show that by summarizing data, running time
can in fact decrease while data size increases, even as a desired risk is
maintained. We thus add data summarization to the list of methods, such as
stochastic optimization, that allow us to perceive data as a computational
resource rather than an impediment.


Bio: Mesrob I. Ohannessian is a postdoctoral researcher at UC San Diego.
During this time, he was a semester-long visitor at the Simons Institute at
UC Berkeley. Previously, he spent two years in France, one at the Microsoft
Research - Inria Joint Centre and another at Université Paris-Sud. He
received his PhD in EECS from MIT. His work has led to an AISTATS best
student paper award (2015), an ERCIM Alain Bensoussan / Marie Curie
Fellowship (2012), and a MERLOT educational material award (2007). His
research interests are in statistics, information theory, machine learning,
and their applications, particularly to problems marked by data scarcity.


Host: Nathan Srebro, nati at ttic.edu






Mary C. Marre
Administrative Assistant
*Toyota Technological Institute*
*6045 S. Kenwood Avenue*
*Room 504*
*Chicago, IL  60637*
*p:(773) 834-1757*
*f: (773) 357-6970*
*mmarre at ttic.edu <mmarre at ttic.edu>*

On Thu, Apr 28, 2016 at 5:50 PM, Mary Marre <mmarre at ttic.edu> wrote:

> When:     Friday, April 29th at 11:00 am
>
> Where:    TTIC, 6045 S Kenwood Avenue, 5th Floor, Room 526
>
> Who:      Mesrob Ohannessian, UC San Diego
>
>
>
> Title: Handling and Harnessing Data Scarcity
>
> Abstract: Two complementary problems arise in modern data analysis. On the
> one hand, despite its abundance, data is often scattered and we need to
> make inferences about rare and unseen events. On the other hand,
> computational costs may force us to prune data before analysis, thus
> imposing artificial data scarcity. How can we handle scarce data? And how
> can we harness scarcity to reap computational savings?
>
> A simple model of an intrinsically rare event is the set of all unseen
> symbols in independent samples from a discrete distribution. Alan Turing
> and Jack Good proposed a simple estimator for the probability of this
> event. We first dismantle the impeccable image of this estimator, by
> showing that it can fail to be consistent in relative error, for even the
> simplest light-tailed distributions. We then give a "no free lunch"
> theorem: no estimator can be universally consistent in relative error. This
> formalizes the intuition that to reasonably infer about rare events,
> further structure on the distribution family is necessary. In this light,
> the old Good-Turing estimator acquires a new reputation, as a highly
> effective specialized rare probability estimator for heavy-tailed
> distributions. This explains its success in areas where such distributions
> arise, such as in natural language modeling. Beyond Good-Turing, this gives
> rise to a methodology that brings discrete rare event inference closer to
> classical tail inference and opens the door to streamlined estimation
> techniques that are inspired by extreme value theory.
>
> We then revisit one of the fundamental questions facing large-scale data
> analysis: is it possible to trade off statistical risk with computational
> time? We advance data scarcity as a mechanism to achieve such tradeoffs. To
> show that the principle applies just as well to non-convex situations, we
> concretize it for the probabilistic k-means problem, also known as vector
> quantization. In particular, we show that by summarizing data, running time
> can in fact decrease while data size increases, even as a desired risk is
> maintained. We thus add data summarization to the list of methods, such as
> stochastic optimization, that allow us to perceive data as a computational
> resource rather than an impediment.
>
>
> Bio: Mesrob I. Ohannessian is a postdoctoral researcher at UC San Diego.
> During this time, he was a semester-long visitor at the Simons Institute at
> UC Berkeley. Previously, he spent two years in France, one at the Microsoft
> Research - Inria Joint Centre and another at Université Paris-Sud. He
> received his PhD in EECS from MIT. His work has led to an AISTATS best
> student paper award (2015), an ERCIM Alain Bensoussan / Marie Curie
> Fellowship (2012), and a MERLOT educational material award (2007). His
> research interests are in statistics, information theory, machine learning,
> and their applications, particularly to problems marked by data scarcity.
>
>
> Host: Nathan Srebro, nati at ttic.edu
>
>
> Mary C. Marre
>
>
>
>
> Mary C. Marre
> Administrative Assistant
> *Toyota Technological Institute*
> *6045 S. Kenwood Avenue*
> *Room 504*
> *Chicago, IL  60637*
> *p:(773) 834-1757 <%28773%29%20834-1757>*
> *f: (773) 357-6970 <%28773%29%20357-6970>*
> *mmarre at ttic.edu <mmarre at ttic.edu>*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20160429/b602bae4/attachment-0001.htm 


More information about the Colloquium mailing list