GEOPM: A Vehicle for HPC Community Collaboration Toward Co-Designed Energy Management Solutions

Jonathan Eastep, Steve Sylvester, Christopher Cantalupo, Federico Ardanaz, Brad Geltz, Asma Al-Rawi, Fuat Keceli, Kelly Livingston

Performance of future large-scale HPC systems will be limited by costs associated with scaling power. Some HPC centers are reaching the limits of their existent site power delivery infrastructure and are facing prohibitive upgrade costs. Others are reaching budgetary limits on their energy operating costs. Without a breakthrough in energy efficiency, the HPC industry may fail to maintain historical performance scaling rates and fall short of 2018-2020 Exascale performance goals by an estimated 2-3x margin. Overcoming this gap will require co-designed hardware and software system energy management solutions and increased collaboration between hardware vendors and the HPC software community. In this work, we introduce the Global Extensible Open Power Manager (GEOPM): a tree-hierarchical, plug-in extensible, open source runtime framework that we are contributing to the HPC community to accelerate collaboration and research toward co-designed energy management solutions. First results with an experimental power rebalancing optimization demonstrate up to 32% improvements in the runtime of CORAL system procurement benchmarks like miniFE and Nekbone in a power-limited Xeon Phi cluster. These promising initial results motivate further work with the community to extend GEOPM to new optimization strategies to achieve further speedups.