[Colloquium] Reminder: Skluzacek/MS Presentation/Nov 19, 2018

Margaret Jaffey via Colloquium colloquium at mailman.cs.uchicago.edu
Fri Nov 16 09:21:31 CST 2018


This is a reminder about Tyler Skluzacek's MS Presentation on Monday.

------------------------------------------------------------------------------
Date:  Monday, November 19, 2018

Time:  3:00 PM

Place:  John Crerar Library 298

M.S. Candidate:  Tyler Skluzacek

M.S. Paper Title: Automated Workflows for Deriving and Extracting
Metadata from Disorganized Data Swamps

Abstract:
To mitigate the effects of high-velocity data expansion and to
automate the organization of filesystems and data repositories, we
have developed Skluma—a system that automatically processes a target
filesystem or repository, extracts content- and context-based
metadata, and organizes extracted metadata for subsequent use. Skluma
is able to extract diverse metadata, including aggregate values
derived from embedded structured data; named entities and latent
topics buried within free-text documents; and content encoded in
images. Skluma implements an overarching probabilistic pipeline to
extract increasingly specific metadata from files. It applies machine
learning methods to determine file types, dynamically prioritizes and
then executes a suite of metadata extractors, and explores contextual
metadata based on relationships among files. The derived metadata,
represented in JSON, describes probabilistic knowledge of each file
that may be subsequently used for discovery or organization. Skluma’s
architecture enables it to be deployed both locally and used as an
on-demand, cloud-hosted service to create and execute dynamic
extraction workflows on massive numbers of files. It is modular and
extensible—allowing users to contribute their own specialized metadata
extractors. Thus far we have tested Skluma on local filesystems,
remote FTP-accessible servers, and publicly-accessible Globus
endpoints. We have demonstrated its efficacy by applying it to a
scientific environmental data repository of more than 500,000 files.
We show that we can extract metadata from those files with modest
cloud costs in a few hours. Additionally, we include an extensive
overview of literature, elucidating Skluma’s role in the data lakes
research ecosystem.

Tyler's advisor is Prof. Ian Foster

Login to the Computer Science Department website for details:
 https://newtraell.cs.uchicago.edu/phd/ms_announcements#skluzacek

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list