[Colloquium] Sam Zhang Dissertation Defense/Dec 9, 2022

Megan Woodward meganwoodward at uchicago.edu
Thu Dec 8 08:24:14 CST 2022


This is an announcement of Sam Zhang's Dissertation Defense.
===============================================
Candidate: Sam Zhang

Date: Friday, December 09, 2022

Time: 12 pm CST

Location: JCL 390

Title: Eliminating the Capacity Variation Penalty for Cloud Resource Management

Abstract: Increasing power grid challenges due to rapid decarbonization and pressure for reduced carbon emissions and power cost compel datacenters to operate with capacity varying in periods of hours or days, perhaps on a dynamic basis in concert with use of renewable generation. With data centers exceeding 10% of load in many grids, the required capacity variation may approach 50%. For today’s computing, variable resource capacity is problematic, causing severe loss in throughput and corresponding resource efficiency.
	Our approach is to create intelligent resource management for variable capacity resources. Traditional resource managers were built with the assumption of constant capacity, scheduling jobs that fail when capacity decreases, causing abrupt job failures and wasted resources. To understand scheduling performance under variable capacity, we define the key dimensions of variation that lead to performance loss. We use both cloud and HPC production workloads and explore the multi-dimensional capacity change space, characterizing scheduler performance in resource efficiency, job failures, and waiting time. Moreover, to improve performance, we consider two dimensions of uncertainty in capacity and workload, exploring the corresponding information space. We systematically consider scenarios that reduce uncertainty, and propose new scheduling techniques that exploit the information to minimize job failures and increase resource efficiency. These scheduling algorithms use this information to cope with capacity decrease and plan for capacity increase.
	We evaluated traditional schedulers under varying capacity and using a diverse set of workloads, including one HPC and three cloud workloads. Results show that capacity variation can decrease goodput by up to 60%, incurring 15-40% job failures. Amongst variability dimensions, the results show that dynamic range, structures, and change frequency are all important; each in some cases producing 10 - 40% goodput losses. Drill down with Google cloud workloads shows that variable capacity can cause serious problems, including up to 70% goodput loss, 20% job failures, and 15X increase in job wait time. Careful study of performance versus variability shows that avoiding major harm requires a variation limit of <10% dynamic range. This prevents cloud from significant load shifting to reduce carbon emissions or power cost.
	We designed and compared performance of new scheduling schemes that seek to tolerate variation, considering a variety of workloads and variation traces. These new schedulers exploit a variety of potential information about workload and capacity variation to reduce uncertainty. Our experimental results demonstrate that these new scheduling algorithms achieve significant performance improvements under resource variability, with up to 3x goodput increase, 10x job failure reduction, and 5x job waiting time reductions. Heuristics to select which jobs to terminates are particularly important. Using job attributes and progress to minimize wasted computation produces 44% goodput increase on average and close to full reduction on job failures. Within the information space, runtime classification is critical information to improve scheduling for variable capacity performance. Exploiting this information, the LongShort algorithm can drastically improve the ability to support variation in capacity from <10 to 50% while maintaining performance. Realistic examples show that with scheduling techniques, a typical data center can achieve benefits of up to 15% carbon emission reduction and 14% power cost savings by exploiting resource capacity variations.
	While capacity variation poses serious challenges on conventional resource managers, our intelligent resource management shows significant improvement, eliminating the variation penalty and demonstrating promising benefits of future variable capacity data centers.

Advisors: Andrew Chien

Committee Members: Andrew Chien, Hank Hoffmann, Sanjay Krishnan, and Varun Gupta


More information about the Colloquium mailing list