ColloquiaTalk: Mark Manasse, Friday, February 14

Margery Ishmael marge at cs.uchicago.edu
Wed Feb 12 11:48:12 CST 2003


----------------------------------------------------------------------------

DEPARTMENT OF COMPUTER SCIENCE - TALK

Friday, February 14, 2003 at 2:30 pm in Ryerson 251

----------------------------------------------------------------------------

Speaker: MARK MANASSE
From:  Microsoft Research - Silicon Valley
http://www.std.org/~msm/

Title: "Approximating the Jaccard metric efficiently:
An introduction to shingleprinting and its applications"

Abstract:
It is often of interest, given a large collection of things, to quickly
determine if many of the things are effectively the same. Consider, for
example, the problem of identifying music from a Napster-like service:
it may be of interest to RIAA to ascertain which of the songs are copies
of songs in their catalog. It might be of use to a search engine for
the web to identify near-duplicate pages, to reduce the clutter on a
results page, or to reduce the number of pages in the full-text index.
It might be of interest to an intelligence agency when scanning a crowd
at a public event, when trying to identify which members of the crowd
are likely to be suspected malefactors.

Note that in all of these, we are interested not in bit-for-bit identity
(which is easy to determine using hashing), but in effective identity,
despite variations in the source due to alternative ripping or
compression, minor editing, or differences due to aging or imprecise
measurements. This leaves us with the problem of finding collections of
highly-similar things in a large collection, or of identifying the most
similar things in a large collection to a given thing.

In general, we assume that there is a good algorithm for determining
whether two things are effectively the same. We will assume that this
algorithm works by feature-extraction on each of the two things, and
then by comparing features.

Having extracted features (and hashing the features), our things can be
reduced to sets of bounded-range integers. The Jaccard metric defines
the similarity of two sets as the cardinality of their intersection
divided by the cardinality of their union. In shingleprinting, we
exhibit a technique for approximating the Jaccard metric by randomly
sampling features from documents, in such a way that we can efficiently
find all pairs of highly-similar things, even from a collection of
billions of things.

We will also look at some recent results where, using shingleprinting to
condense our data storage needs, we looked at the evolution of web pages
on a week-to-week basis.

Hosts: Anne Rogers and Janos Simon


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Margery Ishmael
Secretary to the Chairman, Department of Computer Science
The University of Chicago
1100 E. 58th Street, Chicago, IL. 60637-1581
tel. 773.834.8977 fax. 773.702.8487
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-




More information about the Colloquium mailing list