ColloquiaTalk by John Goldsmith - Friday, February 7, 2003

Tue Jan 28 13:29:50 CST 2003

--------------------------------------------------------------------------------

DEPARTMENT OF COMPUTER SCIENCE - TALK

Friday, February 7, 2003 at 2:30 pm in Ryerson 251

--------------------------------------------------------------------------------- 

Speaker: JOHN A. GOLDSMITH, Professor

From: Department of Linguistics, University of Chicago
http://humanities.uchicago.edu/faculty/goldsmith

Title: "Learning linguistic structure"

Abstract:
The complexity of word-structure in natural languages -- which is known as
morphology in the linguistic literature -- varies considerably across 
languages; the range of morphological complexity across languages varies from:
(i) a small number of languages, such as Chinese, in which words have little
internal structure, to
(ii) those like English, where relatively complex words may be formed (see 
(a))
but most words have relatively simple structure (see (b) ),
(a) over-look-ed, trans-form-ation-al,
(b) form-ed, word-s, relative-ly, simple, most
to
(iii) those in which most words have significant internal structure, and verbs
may average more than 3 components (called morphemes) per word (e.g., Swahili,
Finnish, Turkish).

I will report on the development of an algorithm for discovering morphological
structure automatically from a text of size 10,000 - 1,000,000 running words,
with no prior knowledge of the language. It is based on an application of
Minimum Description Length (MDL, Rissanen 1989) modeling: the algorithm
develops a probabilistic morphology of the data, and the length of this
analysis (in information theoretic terms) can be added to the compressed 
length
of the corpus (again, in information theoretic terms) given the morphology; we
adopt the analysis for which this sum is a minimum.

MDL provides a fine theoretical framework for evaluating models, but must be
augmented with specialized knowledge - heuristics - to know where to look for
reasonable morphological models in a reasonable amount of time, given the 
data.
Poorly designed heuristics can lead to computation requiring hours (or tens of
hours, or more) on a corpus of 100,000 words; well-designed heuristics may run
in as little as 10 seconds.

The current results for languages of group (ii) above, like English, are very
good, and I will review the heuristics and the results. I will also sketch two
areas of current work. (1) With Misha Belkin, we are using graph theoretic
methods to obtain information about how words with specific suffixes are
distributed syntactically. This permits us to deduce, for example, with no
prior knowledge of English structure, that the -s that appears in boy-s is not
the same as the -s in sing-s. (2) We are currently extending these methods to
languages with rich morphologies (group iii above), using string edit distance
between pairs of words to bootstrap the discovery of morphemes.

Some of this work is discussed in: Unsupervised learning of the morphology 
of a
natural language, Computational Linguistics 27:2 pp. 153-198 (2001)

*The talk will be followed by refreshments in Ryerson 255*

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Margery Ishmael
Secretary to the Chairman, Department of Computer Science
The University of Chicago
1100 E. 58th Street, Chicago, IL. 60637-1581
tel. 773.834.8977 fax. 773.702.8487
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-