A fundamental principle of parallel computing is that by subdividing a computation across P processors, one can realize a P-fold reduction in time to solution. For example, a simulation that uses a billion particles or gridpoints can be distributed across two compute nodes and run in half the time – for essentially the same energy – compared with running on just a single node.
“Since clock rates are not longer increasing, this multiplicative effect of parallel computing is the only mechanism we have for increasing the speed of calculations by factors of thousands,” said Misun Min, a computational scientist in Argonne’s Mathematics and Computer Science Division.
Motivated by the importance of this situation as computers approach the exascale era, Min and her colleagues at the University of Illinois in Urbana-Champaign and Washington University explored two critical questions: How far can a given problem scale, and what needs to be done to ensure continued performance gains on high-performance computing (HPC) platforms.
The researchers noted that even though most HPC centers encourage submission of large jobs, most applications do not use the entire machine. They further noted that the reason usually given is that performance efficiency drops.
“Our objective, then, was to quantify the causes of this performance drop-off and to identify potential mitigation strategies,” said Min.
To this end, the team analyzed the scaling performance of two Argonne-developed production codes, Nek5000 and NekCEM, on multi-CPU configurations such as the Cray XK7 and Blue Gene/Q. Focusing on Poisson’s equation – a partial differential equation widely found in theoretical and applied physics and engineering problems – they considered three iterative solution strategies: Jacobi iteration, conjugate gradient iteration, and geometric multigrid. The results identify and quantify several bottlenecks that limit strong scaling for these classes of solvers.
The researchers note that several factors can affect the models’ scalability dramatically. For example, a discontinuous Galerkin formulation (in which the solution is continuous in each element but discontinuous across elements) has a distinct advantage in that each element communicates with at most 6 neighbors, compared with 26 or more for a continuous Galerkin formulation (in which the discretization is based on function values at the nodes and is continuous across elements).
The researchers also investigated the scalability potential of graphical processing units (GPUs). These are becoming increasingly popular in HPC systems and are candidate node architectures for extreme-scale platforms as well. GPUs consist of hundreds to thousands of cores designed to handle multiple tasks independently. A key issue with multi-GPUs, then, is the number of tasks required in order to realize top node performance. The results of the team’s experiments using the electromagnetics code NekCEM show strong scalability for CPU-based simulations down to the limit of one spectral element per core, whereas multi-GPU simulations meet a strong-scale limit arising from the fall-off of single node performance.
What can we conclude about algorithm and architecture performance characteristics and their impact on scalability? Paul Fischer, a professor at UIUC, answered this question by stating: “One startling calculation showed that scalable solution strategies for a computational fluid dynamics simulation at exascale would require about 17 trillion gridpoints to make effective use of the entire machine. Based on this and similar results, we recommend that future development work focuses on high performance for reduced problem sizes on each node. Such an approach will be essential for strong scaling and reduced turn-around time.”
This work appeared in an invited paper titled “Scaling Limits for PDE-Based Simulation,” by P. F. Fisher, K. Heisey, and M. Min, presented at the 22nd American Institute of Aeronautics and Astronautics Computational Fluid Dynamics Conference.