This is an announcement of Tyler Skluzacek's Candidacy Exam.
Candidate: Tyler Skluzacek

Date: Monday, August 23, 2021

Time:  1 pm CST

Title:  Can automated metadata extraction make distributed data swamps more navigable?

Abstract: Scientific data repositories are generally chaotic—files spanning heterogeneous domains, studies, and users are stuffed into an increasingly-unsearchable data swamp without regard for organization, discoverability, or usability.  Files that could contribute to a scientist’s future research may be spread across multiple storage facilities and submerged beneath petabytes of other files, rendering manual annotation and navigation virtually impossible.  To remedy this lack of navigability, scientists require a rich search index of metadata, or data about data, extracted from individual files. In this thesis, I will explore automated solutions for converting dark data swamps into navigable data collections, given no prior knowledge regarding each file’s schema or provenance. I first explore ways to extract metadata from files of vastly different structures by building a robust suite of metadata extraction functions capable of processing an array of file types.  To increase extraction efficiency, I explore automated file type identification methods to apply only applicable extraction functions to files. 

Advisors: Ian Foster

Committee Members: Raul Castro Fernandez, Kyle Chard, Michael Franklin, and Ian Foster

