ColloquiaTalk by Eleazar Eskin, Columbia University - Monday, April 15

Wed Apr 3 12:10:08 CST 2002

Monday, April 15, 2002
2:30 pm
Ryerson 251

Eleazar Eskin, Columbia University
"Sparse Sequence Modeling Applied to Computational Biology and Intrusion 
Detection."

Abstract:
Sequence models have been studied for some time in different contexts
including language parsing and analysis, genomics, and recently in
computer security in the area of intrusion detection.  Many of these
sequences can be characterized as "sparse", that is only a fraction of
the elements of the sequence have meaningful value.  This is the case
in many practical applications, such as the analysis of DNA sequences,
where it is postulated that only about 1-3\% of the sequence has any
biological significance.  Similarly, in intrusion detection, typically
the evidence that an audit stream from a system contains an attack is
often buried in a vast amount of irrelevant information.  Furthermore,
modeling sparse sequences often requires allowing "softer" matches
between a sequence and a canonical model such as allowing for
mismatches.  For example, the classical DNA signal "TATAAT" can often
occur with several mismatches in any position.  Computationally, this
is problematic because there is an exponential number of models which
can match a given sequence.  Thus naive approaches to sparse sequence
modeling are computationally complex in both time and space.

We present a new efficient framework for approaching sparse sequence
modeling problems.  Specifically, we demonstrate this framework on
three problems, classification of amino acid sequences into protein
families, outlier detection over sequences of system calls for
intrusion detection, and signal finding for discovering transcription
factor binding sites in gene promoter regions.  This framework employs
efficient data structures which index the sequences to allow iterating
over all possible sparse models of the sequence.  We modify several
learning algorithms including Boosting, Support Vector Machines, a new
set of outlier detection algorithms to take advantage of this
iteration through possible models in our framework.  While still
considering as rich a set of models as the naive approaches we can
avoid the intractable time and space requirements.

Bio: I am a Ph. D. student in the Computer Science Department of Columbia 
University in New York City. My main research area is Machine Learning 
applied to Intrusion Detection and Computational Biology. I am a member of 
the Data Mining Lab and the Computational Biology group. Before I joined 
the Data Mining Lab, I was formerly a member of the Natural Language 
Processing group. http://www.cs.columbia.edu/~eeskin/

Host: Ridgway Scott

*The talk will be followed by refreshments in Ryerson 255*
Persons with disabilities who may need assistance should call 773.834.8977

If you wish to meet with the speaker, please send an e-mail to Donna 
Brooms  donna at cs.uchicago.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20020403/b643051d/attachment.htm