ColloquiaTalk by Junghoo Cho - Monday, March 12

Wed Mar 7 10:19:54 CST 2001

Monday, March 12 at 2:30 p.m. in Ryerson 251

Talk by Junghoo Cho, Stanford University

Crawling the Web: Discovery and Maintenance of Large-Scale Web Data

In this talk I will discuss the challenges and issues
faced in implementing an effective Web crawler.
A crawler is a program that retrieves Web pages,
commonly for a Web search engine.
Often, a crawler has to download hundreds of millions of pages
in a short period of time and has to constantly monitor and
refresh the downloaded pages.
In addition, the crawler should avoid putting too much pressure
on the visited Web sites and the crawler's local network,
because they are intrinsically shared resources.

These requirements pose many interesting challenges
in the design and implementation of a Web crawler.
For example, how can we parallelize the crawling activity to
achieve maximal download rate with minimal overhead?
How should the crawler revisit pages to maintain the highest "freshness"
of pages? What pages should the crawler download
to improve the "quality" of the downloaded pages?

In the talk I will first go over these challenges and present
some solutions that I have developed. In particular,
I will describe results from an experiment,
in which I monitored more than half a million pages for 4 months.
I also present some theoretical results that
show how to design and operate a Web crawler.
http://www-db.stanford.edu/~cho

*This talk will be followed by refreshments in Ryerson 255*

Please send e-mail to marge at cs.uchicago.edu if you would like to meet 
the speaker.
-- 
Margery Ishmael
Department of Computer Science
The University of Chicago
1100 E. 58th Street
Chicago, IL. 60637

Tel. 773-834-8977  Fax. 773-702-8487