[Colloquium] Zhang/Dissertation Defense/Apr 3, 2019

Margaret Jaffey margaret at cs.uchicago.edu
Wed Mar 20 09:28:44 CDT 2019



       Department of Computer Science/The University of Chicago

                     *** Dissertation Defense ***


Candidate:  Huazhe Zhang

Date:  Wednesday, April 3, 2019

Time:  10:00 AM

Place:  John Crerar Library (JCL) 298

Title: Maximizing Performance in Power Constrained Computing Systems

Abstract:
Power constraint has become arguably the biggest obstacle of computing
system performance scaling. From single processor to large-scale
computing system, e.g. supercomputers and datacenter, they are all
power restricted in one way or another to ensure normal operation.
Various computing systems may require different power management
systems, however, the goal keeps invariant, that is to guarantee a
certain power budget/cap. While enforcing the power cap has to be
mandatory, it is meaningless without delivering performance, e.g. a
trivial way to enforce power cap without considering performance, is
to simply shut down the system. Thus, the real challenge here can be
stated as: given power consumption constraint, how to optimize the
performance of computing system. This is a quite important yet broad
question, in this dissertation, the focus is to solve it for server
systems from single-node scale to large-scale. More specifically, this
dissertation includes 3 major parts on maximizing server system
performance under power cap.

First, we propose PUPiL, a power control system to address the power
challenge at the single node level. At node-level, processors are
constrained by dark silicon –their abundance of transistors enables
them to draw more power than they can safely sustain. Thus, the
physical constraints create a need for power control systems which
guarantee the processor operates within a strict power cap. We make
the key observations of the tradeoffs between existing software-based
power control approach and hardware-based approach:1, hardware
techniques provide significantly faster response – quickly enforcing
power limits 2. soft- ware provide much greater flexibility – by
tailoring resource usage to the current application workload – leading
to high performance efficiency. With these insights, we formulate and
evaluate the hybrid hardware/software power capping system PUPiL. It
takes advantages of the both worlds: the fast response from hardware
and high efficiency from software to deliver relatively high
performance with decent response time. PUPiL is evaluated against
advanced software approach and state-of-the-art hardware approach.
Across a number of power targets and workloads, we find that PUPIL
achieves nearly the same response time as the RAPL and significantly
higher performance than RAPL. On average, PUPIL outperforms hardware
approach by 1.18 –2.4x depending on workload and power target. PUPiL
offers a promising way to enforce power caps with greater performance
than the state-of-the-art hardware-only approach.

Second, we propose PowerShift, a distributed power management system
to address the emerging challenge of power capping coupled
applications in large-scale system. While support for system-wide
power caps has been widely studied, the emerging workload – coupled
applications brings new opportunities as well as challenges. Coupled
applications can be con- currently executed instead of being
previously processed serially. Such application coupling saves IO and
time as the applications now communicate at runtime instead of through
disk. Such coupled applications are predicted to be a major workload
for future exascale supercomputers; e.g., scientific simulations will
execute concurrently with in situ analysis. On the other hand, the
performance behavior of coupled applications is fundamentally
different from independent applications: the speed of the couple is
determined by the speed of the slowest application. Thus, optimizing
coupled application performance under a power cap requires slowing
down the faster application to shift more power to the slower.
Furthermore, due to interaction in the couple, sometimes it is not
possible to increase performance through any amount of power shifting.
Here, the couple scheduler should reduce total power usage to save
energy. We address the unique challenges of coupled applications with
PowerShift, a family of three techniques for shifting power between
dependent applications in a distributed system. Compared to SLURM, a
state-of-the-art job scheduler, PowerShift increase mean performance
over SLURM by 7-14%. Besides improving performance, PowerShift also
recognize when it is not possible to increase performance and will,
instead, reduce energy, achieving 18% energy reduction for a 5%
performance loss. Finally, the dynamic techniques of PowerShift are
resilient to tail behavior and system noise, improving performance in
noisy environments by 30–36%. To our knowledge, PowerShift is the
first work to address the challenge of coupled workloads and
demonstrated improved performance, reduced energy, and dynamic
adjustment to tail behavior and system noise.

Third, we propose PowerShift++, a hierarchical distributed power
control system in- spired by both PUPiL and PowerShift to better
overcome the challenge of dependent applications. While PowerShift has
addressed the unique challenge of coupled applications, two major
limitation greatly reduces practicality and performance efficiency:
First, all distributed power management system rely on some node-level
power capping technique and PowerShift depends on RAPL, a
hardware-only power capping approach, to enforce node-level power
consumption. As studied by PUPiL, however, neither hardware-only
approach nor software-only approach is good enough for node-level
power capping, instead, a hybrid approach may combine the best of both
worlds. Second, PowerShift requires prior knowledge of the application
profiles to derive the optimal power distribution between coupled
applications. As offline profiles are not always available in real
world, this greatly limits the practicality. PowerShift++ addresses
the two major limitations above by coordinating original hybrid
node-level power capping technique with system-level power shifting
and build the power performance model at runtime. More specifically,
PowerShift++ deploys a learn- ing/hardware hybrid node-level power
capping to optimize the resource allocation at each node. The learning
part involves using classifier to predict optimal resource
allocation(socket allocation, usage of hyperthreading, and memory
allocation) for current workload at runtime based on collected
low-level hardware counter information and leave DVFS handled by intel
RAPL hardware capping. This is mainly inspired by PUPiL, however, more
portable by re- quire no code instrumentation for applications. To
release PowerShift++ from depending on offline profiles, it fits power
performance model online based on runtime power performance. The 3
power management framework systematically study how to maximize
performance in power constrained server systems from single node scale
to large-scale systems. The key ideas and insights are highly general
to guide design of real world power control system for wide range of
workloads and platform. The implemented systems are open-sourced and
evaluated to be practical, scalable, reliable and also not limited to
particular applications and systems, which hopefully will serve as a
base model/system to future research on power capping.

Huazhe's advisor is Prof. Henry Hoffmann

Login to the Computer Science Department website for details,
including a draft copy of the dissertation:

 https://newtraell.cs.uchicago.edu/phd/phd_announcements#huazhe

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://www.cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


More information about the Colloquium mailing list