[Colloquium] Seminar Announcement: Why Infinite Exchangeable Mixture Models Fail for Sparse Data Sets Yet Microclustering Succeeds

Thu Mar 10 08:44:29 CST 2016

~Reminder~

Computation Institute Presentation - Data Lunch Seminar (DLS)

Speaker: Rebecca C. Steorts, Duke University, Assistant Professor, Department of Statistical Science; affiliated faculty in Computer Science, Biostatistics, the Social Science Research Institute, and the Information Initiative at Duke
Host:  Kyle Chard
Date:  March 11, 2016
Time: 12:00 PM - 1:00 PM 
Location: The University of Chicago, Searle 240A, 5735 S. Ellis Ave.

Title:  Why Infinite Exchangeable Mixture Models Fail for Sparse Data Sets Yet Microclustering Succeeds

Abstract: 
Record linkage merges together large, potentially noisy databases to remove duplicate entities. Community detection is the process of placing entities into similar partitions or ``communities." Both applications are important to applications in author disambiguation, genetics, official statistics, human rights conflict, and others. It is common to treat record linkage and community detection as clustering tasks. In fact, generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption. For example, when performing record linkage, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the "microclustering property" and discussing a new model that exhibits this property. We illustrate this on real and simulated data. 

Paper: http://arxiv.org/abs/1512.00792

Bio:  
Rebecca C. Steorts received her B.S. in Mathematics in 2005 from Davidson College, her MS in Mathematical Sciences in 2007 from Clemson University, and her PhD in 2012 from the Department of Statistics at the University of Florida under the supervision of Malay Ghosh. She was a Visiting Assistant Professor in 2012--2015, where she worked closely with Stephen E. Fienberg. She is currently an Assistant Professor in the Department of Statistical Science at Duke University. Rebecca was named to MIT Technology Review's 35 Innovators Under 35 for 2015 as a humantarian in the field of software. Her work was profiled in the Septmember/October issue of MIT Technology Review and she was recognized at a special ceremony along with an invited talk at EmTech in November 2015. In addition, Rebecca is a recepient of the Metaknowledge Network Templeton Foundation Grant, a National Science Founaton (NSF) SES grant, the University of Florida (UF) Graduate Alumni Fellowship Award, the U.S. Census Bureau Dissertation Fellowship Award, and the UF Innovation through Institutional Integration Program (I-Cubed) and NSF for development of an introductory Bayesian course for undergraduates. Rebeccahas been awarded Honorable Mention (second place) for the 2012 Leonard J. Savage Thesis Award in Applied Methodology. Her research interests are in large scale clustering and record linkage for computational social science applications.

Homepage: http://www2.stat.duke.edu/~rcs46/

Information:  Lunch will be provided

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20160310/ca9aa7f8/attachment.htm