[Colloquium] REMINDER: Peter Chew Talk Today
Katie Casey
caseyk at cs.uchicago.edu
Wed Oct 21 08:41:19 CDT 2009
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF CHICAGO
Date: Wednesday, October 21, 2009
Time: 2:30 p.m.
Place: RY 255
----------------------------------------------------------
Speaker: Peter Chew
From: Sandia National Laboratories
Title: Information-theoretic improvements to multilingual document
clustering
Abstract: Most approaches to textual information retrieval are based
on the vector space model (VSM). VSM-based approaches are
unsupervised; the only requirement is that the input be broken down
into logical groupings in each of the modes of analysis. Typically,
but not necessarily, analysis is in two modes (terms and documents),
implying arrangement of the data into a matrix format, rows and
columns representing terms and documents respectively. This
representation opens up possibilities for manipulation of the data
using a panoply of algebraic operations such as matrix multiplication,
singular value decomposition, and for analysis in three or more modes,
tensor decompositions such as PARAFAC. WIth appropriate parallel
corpora and algebraic methods, it is possible to tackle a particular
problem of interest to us, clustering documents in multiple languages
by topic, or multilingual document clustering (MDC) for short.
We present empirical results from a number of experiments showing that
insights from information theory can significantly improve MDC. These
include an alternative to the popular log-entropy term weighting
scheme, substituting the traditional term-by-document analysis with a
term-by-term analysis, and using unsupervised morphological analysis
(for example, from Linguistica) for feature extraction in the "term"
mode(s). All these information-theoretic enhancements, being
unsupervised like our VSM-based approaches to MDC, have the advantage
that they can in theory be generalized to many different domains and
languages. We also demonstrate our results impressionistically by
presenting a visualization of a multilingual dataset, showing that
translated documents in five languages do indeed cluster together
across language boundaries, in addition to clustering in larger
"topic" groups.
Please note that the talk is being held in RY 255.
More information about the Colloquium
mailing list