[Colloquium] Tomorrow: Liu/Dissertation Defense/6-20-06

Margaret Jaffey margaret at cs.uchicago.edu
Mon Jun 19 09:19:42 CDT 2006


This is a reminder of Jing Liu's dissertation defense tomorrow.

-----
	Department of Computer Science/The University of Chicago

				*** Dissertation Defense ***


Candidate: Jing Liu

Date:  Tuesday, June 20, 2006

Time and Location:  10:00 a.m. in Ryerson 251

Title:  Two Perspectives on Biological Sequence Analysis

Abstract:
In my thesis, a pair of reading glasses with two different lenses
(filters), physical chemistry and computational
linguistics, is used to ``read" the large warehouse of biological
sequences dataset.

We examine the PDBSELECT dataset, a non-redundant subset of PDB,
and analyze the 3-D structure of the protein sequences with
a physical chemistry lens. The role of pair-wise interactions
between neighboring amino acids in protein sequences is examined
by studying residue pairs whose sidechains are closely aligned
in the sense that their initial (CA--CB) segments are nearly parallel.
This small but significant fraction of residue pairs tends to be highly
polar in composition, including many like-charged pairs.
In addition, residue pairs with such closely aligned sidechains appear
overwhelmingly in loops or at boundaries between different secondary
structures. We examine the conformations of two different like-charged
pairs in detail and show that each pair displays similar characteristic
structural correlations which are different from what is found for the
same pairs when their sidechains are not closely aligned.

The biological sequences are viewed as an extension of natural language
with fixed alphabet and modular architecture and analyzed with a
computational linguistics lens. We describe Baum-Welch and  Viterbi
training algorithms that can automatically extract  context features of
biological sequences. We do this by using unsupervised language
acquisition techniques developed in
computational linguistics. The key new element we borrow from  
unsupervised
language acquisition is  the concept of a lexicon, a list of building
blocks or  words  of a language. There is a distinct lexicon
for each data set, determined by our algorithms automatically.

We revise de Marcken's concatenative model and implement the Baum-Welch
algorithm and Viterbi training algorithm to process a large warehouse of
biological sequences. When tested on a English corpus, the novel The
Adventures of Tom Sawyer consisting of $390,487$ characters, our Viterbi
training algorithms shows comparable performance and higher efficiency
than de Marcken's model. We applied the Viterbi training algorithm to  
the
PIR-NREF database of protein sequences containing $2.38*10^8$  
characters.
The resulting protein lexicon was composed of over two million words.
The maximum word length  is $360$ characters. The computational  
experiment
required $O(10^{12})$ floating point computations. Our algorithm is
efficient in handling large databases. Finally, we apply the Viterbi
training algorithm to a search-by-content eukaryotic promoter
identification algorithm PromoterScreen. In our experiments on a human
dataset, three lexicons are trained from the promoter, intron and CDs  
training
dataset from a public Representative Benchmark Data Sets of Human DNA
Sequences provided by Berkeley Drosophila Genome Project web site. When
the trained PromoterScreen model is applied to the review dataset by  
Fickett
and Hatzigeiogiou, it is shown that PromoterScreen is comparable against
published programs for identifying the eukaryotic promoter regions.

We also review techniques for defining metrics on spaces
of measures. We consider both discrete and continuous probability  
spaces.
As an example of the use of a metric on discrete probability
distributions, we
measure and compare the distances between lexicons
drawn from the English language over a historical period of time by  
using
several different metrics. Those metrics can be used as a quantitative
measurement of lexicon distance in our future work.

Candidate's Advisor: Prof. L. Ridgway Scott

A draft copy of Ms. Liu's dissertation is available in Ry 161A.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey                             margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 161A)        (773) 702-6011
The University of Chicago                  http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=




More information about the Colloquium mailing list