[Colloquium] Reminder - Phil Long talk 2:30 today
Meridel Trimble
mtrimble at tti-c.org
Wed May 21 09:11:53 CDT 2003
--------------------------------------------------------------------------------
TOYOTA TECHNOLOGICAL INSTITUTE
-------------------------------------------------------------------------------
Date: Wednesday, May 21st, 2003
Time: 2:30 p.m.
Place: Ryerson Hall 251
Speaker: Philip M. Long
Genome Institute of Singapore
Title: Boosting, Minimum Majority, and Microarray Data
Abstract:
Microarrays allow scientists to test the level of expression of many genes in a
given tissue sample. Recent research has shown that machine learning can be
applied to learn rules to accurately predict properties of tissue samples based
on these "expression profiles."
Boosting algorithms for machine learning work by repeatedly running a
subalgorithm on variants of a dataset in which the examples are reweighted to
concentrate on the difficult cases. Boosting has been successfully applied in a
variety of domains. However, previous work has suggested that it is not well-
suited to microarray data.
We have found one reason why AdaBoost, the standard boosting algorithm, does
not perform well on microarray data, and identified a simple modification that
significantly improves its accuracy. We call the modified algorithm AdaBoostVC.
On benchmark data, AdaBoostVC is typically as accurate or a little more
accurate than two algorithms using Support Vector Machines that we chose to be
representative of the state of the art. AdaBoostVC appears to be especially
useful for finding accurate classification rules that interrogate the level of
expression of few genes -- rules like this give scientists clues about which
genes play key roles in biological processes associated with the tissue
classes.
In this talk, I will describe AdaBoostVC, and the ideas behind it. I also will
describe another algorithm that was inspired by a theoretical analysis of
boosting. This algorithm, the Minimum Majority Algorithm, is conceptually
simpler than boosting, and is based on traditional, easily understood
principles. It also outputs simpler class prediction rules. While the accuracy
of the rules it outputs is as good as boosting on most of the benchmark
datasets we tried it on, on a few, including microarray data, it is much worse.
Looking at when it does worse, including some experiments with artificial data,
seems to shed additional light on why boosting works, and has inspired an
alternative theoretical analysis of boosting. An aspect of this theoretical
analysis in turn suggests one avenue for improving boosting algorithms; some
preliminary experiments support its promise. To an extent, this study of
boosting inspired the design of AdaboostVC.
(This is joint work with Sanjoy Dasgupta and Vinsensius Vega.)
Phil Long is the Senior Group Leader for Information and Mathematical Sciences
at the Genome Institute of Singapore. He got a Ph.D. in Computer Science from
UC Santa Cruz in 1992. Then he did postdocs at the Graz Technical University
and Duke University during 1992-6. He was on the faculty of the Computer
Science Department of the National University of Singapore from 1996-2001,
before joining the Genome Institute of Singapore. He has published in STOC,
FOCS, NIPS, AAAI, COLT, ICML and SODA.
*Refreshments will be served after the talk in Ryerson 255*
If you wish to meet with the speaker, please send e-mail to Meridel Trimble
mtrimble at uchicago.edu
Toyota Technological Institute at Chicago
More information about the Colloquium
mailing list