[CS] Announcement for Sam Zhang Candidacy Exam

n-yack at uchicago.edu n-yack at uchicago.edu
Wed Jun 9 10:30:30 CDT 2021


This is an announcement of Sam Zhang's Candidacy Exam.
===============================================
Date: June 10, 2021

Time: 11:00 am

Location: Zoom link: https://uchicago.zoom.us/j/98594404983?pwd=NUJhVGFCSVAwMDkrbnZueTIwSUZPdz09
Password: 248569

Candidacy Exam: Sam Zhang

Title: Scheduling Challenges and Exploration for Data Centers with Variable Resource Capacity

Abstract: Abstract:

Electrical power consumption growth has accelerated both due to the end of Dennard Scaling and the continued rapid growth in computing demand – performance and capacity. These limits make dynamic power management for cost, cooling, sharing, or stabilizing power grid a source of variable capacity for datacenters. At another level, carbon emission management can give rise to dynamic capacity. Aggressive carbon footprint reduction pledge, compounding with the rapid growth of hyperscale cloud operators (e.g. Amazon, Microsoft, Google, etc.), requires data centers to reduce power, perhaps on a dynamic basis in concert with use of renewable generation. These varied scenarios suggest clusters, scheduling domains, even entire data centers will have dynamic power constraints, effectively changing their capacity for computing. Today’s resource management systems and schedulers generally assume full knowledge of resource capacity, and presume that it is stable going forward. It is not known how to achieve high useful throughput in the face of continual resource capacity variations.

Based on emerging examples, we aim to understand the scheduling performance under resource variability and provide effective resource utilization despite sudden, perhaps large, changes in the available resources. We define the variable resource capacity problem, key dimensions of resource capacity variation, and give specific examples that arise from the natural world (carbon-content, power price, datacenter cooling, and more). Key dimensions of the resource capacity variation include dynamic range, frequency, and structure. With these dimensions, an empirical trace can be characterized, abstracting it from the many possible important real-world generators of variation. Resource capacity variation can arise from many causes including weather, market prices, renewable energy, carbon emission targets, and internal dynamic power management constraints. We give examples of these sources of variable capacity. We assess performance on HPC and cloud workloads (Microsoft, Google), with both dedicated and oversubscription scheduling models, exploring the dimensions of variation. Extensive evaluation of scheduling in a single data center, varying workloads and scheduling models, shows that in spite of the same total quantity of available resources, capacity variation can incur significant performance degradation and Service-Level-Objective(SLOs) violations, with goodput reduction of 10-60% and up to 50% SLO violations.

In order to mitigate performance degradation and reduce SLO violations, we consider foresight, scheduler improvements, and headroom capacity. These improvement techniques tackle the resource variability challenge from different angles, through external factor prediction, scheduling policies, and additional hardware support. For a single data center, we find only foresight of a few hours restores 80% to 100% goodput. Further, heuristics that selectively terminate or slowdown jobs based on job attributes and current progress produce significant improvement, 44% goodput increase on average and close to full reduction on SLO violations. With realistic examples, scheduling techniques also demonstrate benefits of up to 15% carbon emission reduction and 14% power cost savings by exploiting resource capacity variations. We will continue to study the performance of resource variability in multi-data centers over fixed capacity data centers using realistic workloads. Further, we will explore improvement techniques such as load shifting policies and prediction. We expect by enabling intelligent workload shifting, data centers can provide robust performance under capacity resource variation and exploit variation related benefits. Furthermore, we describe the research plan and the full thesis outline in the proposal. The complete experiment results will be presented in the final thesis. 

Advisors: Andrew Chien

Committee Members: Andrew Chien, Hank Hoffmann, and Sanjay Krishnan




More information about the cs mailing list