[Colloquium] REMINDER: Peter Chew Talk Today

Wed Oct 21 08:41:19 CDT 2009

DEPARTMENT OF COMPUTER SCIENCE

UNIVERSITY OF CHICAGO

Date: Wednesday, October 21, 2009
Time: 2:30 p.m.
Place: RY 255

----------------------------------------------------------

Speaker:	Peter Chew

From:		Sandia National Laboratories

Title: 	Information-theoretic improvements to multilingual document  
clustering

Abstract: Most approaches to textual information retrieval are based  
on the vector space model (VSM). VSM-based approaches are  
unsupervised; the only requirement is that the input be broken down  
into logical groupings in each of the modes of analysis. Typically,  
but not necessarily, analysis is in two modes (terms and documents),  
implying arrangement of the data into a matrix format, rows and  
columns representing terms and documents respectively. This  
representation opens up possibilities for manipulation of the data  
using a panoply of algebraic operations such as matrix multiplication,  
singular value decomposition, and for analysis in three or more modes,  
tensor decompositions such as PARAFAC. WIth appropriate parallel  
corpora and algebraic methods, it is possible to tackle a particular  
problem of interest to us, clustering documents in multiple languages  
by topic, or multilingual document clustering (MDC) for short.

We present empirical results from a number of experiments showing that  
insights from information theory can significantly improve MDC. These  
include an alternative to the popular log-entropy term weighting  
scheme, substituting the traditional term-by-document analysis with a  
term-by-term analysis, and using unsupervised morphological analysis  
(for example, from Linguistica) for feature extraction in the "term"  
mode(s). All these information-theoretic enhancements, being  
unsupervised like our VSM-based approaches to MDC, have the advantage  
that they can in theory be generalized to many different domains and  
languages. We also demonstrate our results impressionistically by  
presenting a visualization of a multilingual dataset, showing that  
translated documents in five languages do indeed cluster together  
across language boundaries, in addition to clustering in larger  
"topic" groups.

Please note that the talk is being held in RY 255.