Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

16:30 - 17:00
Evaluating the Potential of Elastic Jobs in HPC Systems

David Eberius, Md. Wasi-ur-Rahman, David Ozog
Intel Corporation, USA

It is generally assumed that elastic parallel applications, with the ability to dynamically resize their process count, would provide numerous benefits to High-Performance Computing (HPC) systems and applications. Supporting this capability, however, requires significant effort at several layers of the HPC software stack. At a minimum, the resource management system, the distributed communication libraries, and the distributed applications themselves would have to explicitly support elasticity. With this level of widespread support required, there must be significant motivation for developers to commit to adding this capability. We aim to determine whether there are practical benefits to supporting elasticity by simulating HPC systems with support for elastic jobs using real-world job data. Our simulations show significant improvements of adding elastic jobs with up to 35.34% higher system utilization, 75.3% lower runtime, 99.76% lower wait time, and 75.22% lower total turnaround time.

14th IEEE International Workshop on

Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

held in conjunction with SC23: The International Conference for High Performance Computing, Networking, Storage and Analysis