Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression

Hongzhang Shan, Samuel Williams and Calvin W. Johnson

MPI reductions are widely used in many scientific applications and often become the scaling performance bottleneck. When performing reductions on vectors, different algorithms have been developed to balance messaging overhead and bandwidth. However, most implementations have ignored the effect of single-thread performance not scaling as fast as aggregate network bandwidth. In this work, we propose, implement, and evaluate two approaches (threading and exploitation of sparsity) to accelerate MPI reductions on large vectors when running on manycore-based supercomputers. Our benchmark results show that our new techniques improve the MPI Reduce performance up to 4x and improve BIGSTICK application performance by up to 2.6x.