[Colloquium] Reminder - Phil Long talk 2:30 today

Wed May 21 09:11:53 CDT 2003

--------------------------------------------------------------------------------
TOYOTA TECHNOLOGICAL INSTITUTE 
-------------------------------------------------------------------------------

Date: Wednesday, May 21st, 2003

Time: 2:30 p.m.

Place: Ryerson Hall 251 

Speaker: Philip M. Long 
Genome Institute of Singapore

Title:  Boosting, Minimum Majority, and Microarray Data

Abstract: 
Microarrays allow scientists to test the level of expression of many genes in a 
given tissue sample. Recent research has shown that machine learning can be 
applied to learn rules to accurately predict properties of tissue samples based 
on these "expression profiles." 

Boosting algorithms for machine learning work by repeatedly running a 
subalgorithm on variants of a dataset in which the examples are reweighted to 
concentrate on the difficult cases. Boosting has been successfully applied in a 
variety of domains. However, previous work has suggested that it is not well-
suited to microarray data. 

We have found one reason why AdaBoost, the standard boosting algorithm, does 
not perform well on microarray data, and identified a simple modification that 
significantly improves its accuracy. We call the modified algorithm AdaBoostVC. 
On benchmark data, AdaBoostVC is typically as accurate or a little more 
accurate than two algorithms using Support Vector Machines that we chose to be 
representative of the state of the art. AdaBoostVC appears to be especially 
useful for finding accurate classification rules that interrogate the level of 
expression of few genes -- rules like this give scientists clues about which 
genes play key roles in biological processes associated with the tissue 
classes. 

In this talk, I will describe AdaBoostVC, and the ideas behind it. I also will 
describe another algorithm that was inspired by a theoretical analysis of 
boosting. This algorithm, the Minimum Majority Algorithm, is conceptually 
simpler than boosting, and is based on traditional, easily understood 
principles. It also outputs simpler class prediction rules. While the accuracy 
of the rules it outputs is as good as boosting on most of the benchmark 
datasets we tried it on, on a few, including microarray data, it is much worse. 
Looking at when it does worse, including some experiments with artificial data, 
seems to shed additional light on why boosting works, and has inspired an 
alternative theoretical analysis of boosting. An aspect of this theoretical 
analysis in turn suggests one avenue for improving boosting algorithms; some 
preliminary experiments support its promise. To an extent, this study of 
boosting inspired the design of AdaboostVC. 

(This is joint work with Sanjoy Dasgupta and Vinsensius Vega.) 

Phil Long is the Senior Group Leader for Information and Mathematical Sciences 
at the Genome Institute of Singapore. He got a Ph.D. in Computer Science from 
UC Santa Cruz in 1992. Then he did postdocs at the Graz Technical University 
and Duke University during 1992-6. He was on the faculty of the Computer 
Science Department of the National University of Singapore from 1996-2001, 
before joining the Genome Institute of Singapore. He has published in STOC, 
FOCS, NIPS, AAAI, COLT, ICML and SODA. 
 *Refreshments will be served after the talk in Ryerson 255* 

If you wish to meet with the speaker, please send e-mail to Meridel Trimble 
mtrimble at uchicago.edu 

Toyota Technological Institute at Chicago