Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

15:30 - 16:00
Risk-Aware Scheduling Algorithms for Variable Capacity Resources

Lucas Perotin, Anne Benoit, Yves Robert
Laboratoire LIP, ENS Lyon, Lyon, France

Chaojie Zhang
Microsoft Research, Seattle, WA

Rajini Wijayawardana, Andrew A. Chien
University of Chicago, Chicago, IL

The drive to decarbonize the power grid to slow the pace of climate change has caused dramatic variation in the cost, availability, and carbon-intensity of power. This has begun to shape the planning and operation of datacenters. This paper focuses on the design of scheduling algorithms for independent jobs that are submitted to a platform whose resource capacity varies over time. Jobs are submitted online and assigned on a target machine by the scheduler, which is agnostic to the rate and amount of resource variation. The optimization objective is the goodput, defined as the fraction of time devoted to effective computations (re-execution does not count). We introduce several novel algorithms that: (i) decide which fraction of the resources can be used safely; (ii) maintain a risk index associated to each machine; and (iii) achieves a global load balance while mapping longer jobs to safer machines. We assess the performance of these algorithms using one set of actual workflow traces together with three sets of synthetic traces. The goodput achieved by our algorithms increases up to 10% compared to standard first-fit approaches, while we never experience any loss in complementary metrics such as the maximum or average stretch.

14th IEEE International Workshop on

Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

held in conjunction with SC23: The International Conference for High Performance Computing, Networking, Storage and Analysis