[Colloquium] Daniar Kurniawan Dissertation Defense/Apr 10, 2024

Tue Apr 9 09:48:10 CDT 2024

This is an announcement of Daniar Kurniawan's Dissertation Defense.
===============================================
Candidate: Daniar Kurniawan

Date: Wednesday, April 10, 2024

Time: 10:30 am CT

Remote Location:  https://uchicago.zoom.us/j/8085185734?pwd=RDdqRnhjZHlkMkFnY29Lc3VSbEhEQT09<https://urldefense.com/v3/__https://uchicago.zoom.us/j/8085185734?pwd=RDdqRnhjZHlkMkFnY29Lc3VSbEhEQT09__;!!BpyFHLRN4TMTrA!5JVkSukiyM50Zu6BcWn2qRWB8HtAoHamV4vqy-WNxC-XObecfuxhrsX01D7UIFauz2xaSJTcFXVJAWMHfnaMl52tpIjXz9nDFAvF$>

Location: JCL 390

Title: Storage Support for Scaling Embedding Table in Recommendation Systems Deployment

Abstract: With recommendation systems playing an increasingly pivotal role in guiding user decisions online, their impact on user engagement cannot be overstated. Recent studies show that a significant amount of content―30% of all traffic on Amazon’s website, 60% of the videos on YouTube, and 75% of the viewed movies on Netflix came from suggestions made by recommendation algorithms. As users become more reliant on these systems, they expect higher-quality recommendations that are tailored to their individual preferences. To meet this demand, recommendation systems must be able to encode richer semantic relationships, which requires larger embedding vector tables (EV tables). This has led to a tripling of EV table sizes every two years (1.5× annual growth). Consequently, the space management of EV tables becomes challenging: many real-world EV tables contain billions of embedding vectors that require tens of TBs of memory capacity. Such DRAM-heavy architectures account for significant operational costs for DRS users measured in millions of dollars―nearly 80% of all AI-related deployment in Meta’s data centers in 2020 directly supported DRSs. A natural solution to this problem is by moving the large EV tables to the backend storage (SSDs or HDDs). However, it can introduce performance instability, particularly for large-scale DRSs tasked with meeting stringent microsecond-scale tail latency Service Level Objectives (SLOs). The current trend of microservices and machine learning deployment further exacerbates these challenges, with tail-latency SLOs expected to become even tighter over time.

The storage industry and research community have dedicated significant efforts to address the issue of unpredictable latency in SSDs. Various approaches, including white-box, gray-box, and black-box techniques, have been proposed to mitigate this challenge. While each approach offers unique advantages and trade-offs, the optimal evolution of the storage stack to accommodate the evolving needs of recommendation systems remains a critical area of exploration. In this dissertation, we seek to answer the question: How should the storage stack evolve to meet the growing demands for low and predictable latencies and cost-efficiency in the context of the burgeoning usage of recommendation systems? To support this statement, we develop a set of solutions each delving into distinct facets of improving storage support mechanisms for recommendation systems.

We start developing our solutions from the inference layer of the Deep Recommendation System, where the performance variance originates. This layer is critical as it directly impacts the user experience by determining the relevance and accuracy of the recommendations provided. By focusing on optimizing this foundational layer, we aim to enhance the overall performance and efficiency of the recommendation system. To realize this goal, we built EVSTORE: A caching systems that exploit groupability pattern between items for scaling and enabling cost-efficient recommendation system deployment. EVSTORE’s main contributions lie in EVSTORE’s 3-layer “L1-to-L3” caching design (EVCache, EVMix, and EVProx). We have fully integrated EVSTORE within Facebook (Meta)’s DLRM, including various implementation-level optimizations and offline supporting tools (≈9 KLOC) that are released publicly. Our evaluation based on real production DRS traces shows that EVSTORE can reduce the average and p90 latency by up to 23% and 27% respectively, while increasing the throughput by 4× at only 0.2% accuracy reduction. Collectively, fully optimized EVSTORE implementation can achieve a 94% reduction of the DRS memory footprint.
These memory savings correspond to hundreds of millions of dollars for a large cloud provider.

Although EVSTORE’s caching layers facilitate a performant and cost-efficient deployment of recommendation systems, its reliance on SSDs as backend storage can introduce performance instability due to unpredictable internal SSD activities such as garbage collection (GC), wear leveling, and buffer-flush. These background activities can disrupt user I/O requests and lead to tail latencies. For instance, GCs alone can cause up to a 60× increase in latency, easily violating SLOs. To tackle this challenge, we turn to disciplined data science to create HEIMDALL, an efficient and accurate ML-based I/O admission method. HEIMDALL demonstrates significant advancements, improving our ML-for-storage case study by 40% in tail latency reduction and 2.3× in inference throughput improvement, while achieving up to 99% accuracy (with 95% on average) across a large set of benchmarks. We have developed HEIMDALL with 20.9 KLOC and seamlessly integrated it into Ceph and the Linux kernel with only 0.05μs inference overhead for each I/O.

With HEIMDALL handling tail latency, we target further enhancements in SSD performance stability by improving the read latency. Our solution, TINFETCH, presents a novel approach to block prefetching by employing accurate streams disentanglement and suffix prediction with the help of pretrained Neural Network. TINFETCH consists of two cohesive modules: the adaptive (heuristic) module, responsible for disentangling random accesses and fine-tuning prefetching aggressiveness, and the predictive (learning-based) module, designed to forecast future block access. Our evaluation, conducted on real workload traces from Microsoft, Tencent, and Alibaba, demonstrates that TINFETCH outperforms other prefetching solutions, achieving the highest hit rate and prefetching accuracy of up to 81% and 87%, respectively. Furthermore, TINFETCH exhibits the highest hit rate per storage bandwidth load, surpassing the best state-of-the-art learning-based and heuristic-based prefetchers by 2.6× and 1.5× respectively. Additionally, TINFETCH’s optimized implementation in C/C++ ensures low-overhead operation, enabling inference latency in sub-μs on a consumer-level CPU node with a minimal memory footprint of only 25KB. This characteristic allows for seamless integration of TINFETCH into low-level storage systems with negligible overhead.

Advisors: Haryadi Gunawi

Committee Members: Haryadi Gunawi, Ymir Vigfusson, and Hank Hoffmann

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/colloquium/attachments/20240409/d82bd04f/attachment-0001.html>