[Colloquium] Stephen Fitz Dissertation Defense/Oct 9, 2023

meganwoodward at uchicago.edu meganwoodward at uchicago.edu
Tue Oct 3 11:20:45 CDT 2023


This is an announcement of Stephen Fitz's Dissertation Defense.
===============================================
Candidate: Stephen Fitz

Date: Monday, October 09, 2023

Time:  9 am CST

Remote Location: https://urldefense.com/v3/__https://uchicago.zoom.us/j/96439516683?pwd=YXVXYlhLV2puR3ZrRGNvTkZtekR0dz09__;!!BpyFHLRN4TMTrA!5iz9jNMtdwjwfAWjiy1AT_ZSkr9f7tJi8bfejEHARWXMjwBClmTfzG_csKLKKz1yGEHlKjptvQisnial38oSVK99ajsLJg$


Title: The Shape of Words - topological structure in language representations

Abstract: "The Shape of Words" is a study of linguistic data and its representations from a topological perspective.

Over the past century, we saw an emergence of new branches of mathematics exploring properties of high-dimensional objects. These ideas came initially from the Erlangen Program outlined by Felix Klein in his seminal work on the formalization of geometry as the study of invariants under algebraically defined groups of transformations. The body of research produced through this program led to the development of category theory as a unifying language connecting previously isolated branches of mathematics. A particularly powerful type of such category theoretic relationship is expressed by the homotopy and homology functors, linking the realms of topology and abstract algebra. These notions belong to the branch of mathematics known as algebraic topology, which has been in accelerated development over the past century, producing a set of powerful tools for probing global and local properties of manifolds. In recent years advances in hardware, availability of large data sets, as well as innovation in architectural and algorithmic design, enabled successful application of methods from manifold theory to real-world data. Following this general trend, I develop computational apparatus, based on ideas from algebraic topology, in order to gain deeper insight into linguistic structures. In "The Shape of Words" I present a collection of research endeavors that view natural language through a topological lens. 

In part 1, I develop a representation of raw text data as a topological object (which I call "the word manifold"), whose topology encodes the n-gram patterns of words in context. The word manifold comes with a natural geometric realization, and can be used to produce vector space representations of words and ngram contexts by training a neural autoencoder.

In part Part 2, I present the main contribution - a striking discovery about the topological structure of language models. I define a metric of topological complexity called perforation, designed for the purpose of visualizing changes in representation manifolds of linguistic data induced by neural language models during training. Experiments with natural and synthetic corpora lead to a discovery about the relationship between global representations of linguistic units at the input embedding layer to a gated recurrent language model, and the corresponding deep contextualized representations of the hidden state.

Finally, in part 3, I explore applications of topological methods developed in the first two parts to a variety of problems in NLP. I show that topological document representation leads to significant model compression without sacrificing task performance on a downstream task of multi-document question answering. I also apply algorithms developed here to a new analysis of the Voynich manuscript - a centuries old linguistic enigma that remains unsolved. Finally, I explore topological aspects of sentence representations arising in large transformer based language models trained on trillions of tokens of natural language text. I develop a novel approach to visualizing moral dimensions in LLMs and discuss applications to AI safety.

Advisors: John Goldsmith and Janos Simon

Committee Members: John Goldsmith, Risi Kondor, and Janos Simon



More information about the Colloquium mailing list