[Colloquium] Talk by Nathan Srebro, TTI-C on November 3, 2008

Wed Oct 22 13:51:15 CDT 2008

DEPARTMENT OF COMPUTER SCIENCE

UNIVERSITY OF CHICAGO

Date: Monday, November 3, 2008
Time: 3:45 p.m.
Place: RY 251

----------------------------------------------------------

Speaker:	Nathan Srebro

From:		TTI-Chicago

Web page:	http://ttic.uchicago.edu/~nati/

Title: On the Informational Cost of Tractability

Abstract: As with many other machine learning problems, most  
formulations of clustering correspond to non-convex, NP-hard,  
optimization problems. Yet, when the data really is clustered, and  
enough data is available, local search and other heuristics tend to be  
able to recover the clustering. At an extreme regime, we can even  
prove the clustering is tractably recoverable. On the other extreme,  
if the data is not well clustered, or not enough data is available,  
the "optimal" clustering, although hard to find, is meaningless. This  
leads to the common wisdom that "clustering isn't hard: its either  
easy, or not interesting". Is it in-fact the case that when a  
meaningful clustering is statistically recoverable, it is also easy to  
find computationally? If so, we certainly do not have a rigorous  
understanding of why this is the case. Or, is there a regime in which  
the clustering is statistically recoverable, but not tractably  
recoverable? E.g., might we need more samples in order to ensure that  
our recovery methods work, beyond what is statistically necessary if  
we had infinite computational power? Can we quantify this increase in  
the required sample size due to our computational limitations? In this  
talk I will discuss the above issues, mostly in the context of  
clustering data from a mixture of Gaussians, but also in some other  
machine learning problems. I will also mention other ways in which  
increasing the size of the data reduces, rather than increases, the  
amount of required computation, contrary to the standard approach of  
studying the complexity of computation as increasing with the size of  
the data.