Skip to main content
Article | Mathematics and Computer Science Division

Using machine learning approaches for load balancing of climate models

Iterative approach to load balancing through machine learning.

Complex climate simulation models must exploit today’s most advanced leadership-class compute platforms to provide meaningful predictions. Scheduling these complex simulations on hundred of thousands of compute nodes is a challenging task referred to as load balancing, which ensures that all nodes are kept busy. For scientific codes such as the Community Earth System Model (CESM), load balancing is especially challenging, and finding the best balance for all situations is beyond the abilities of even the most experience computer scientists. CESM is one of the most widely used climate models in the world. Its six software components correspond to the five main physical components of the climate system, namely land, ocean, atmosphere, sea-ice, and land ice, and a coupler. Each of the climate components exhibits different scalability patterns, and communicates through a coupler, which has its own scalability pattern. Researchers have devised various approaches in an effort to achieve optimal load-balancing parameter configurations for each component. For example, one popular approach involves a heuristic method that gathers benchmarking data, calibrates a performance model using the data, and makes decisions about optimal allocation by using the model. The model has several shortcomings, however, when it tries to load balance one of CESM’s key components – sea ice, or CICE.

The problem with CICE component is that the optimal load balancing parameters are unknown in general case. At the same time poor load balancing in one component like CICE can result in inferior overall performance of CESM,” said Yuri Alexeev, an assistant computational scientist at Argonne. The shortcomings arise from the fact that load balancing in CICE occurs only where sea ice is located geographically. But because the processors are allocated across the entire Earth grid and several locations on the grid do not have sea ice, a poor fit results.

To address these shortcomings, a team of researchers from Argonne National Laboratory and the National Center for Atmospheric Research (NCAR) has devised a load-balancing algorithm based on machine learning. The algorithm involves two phases: a parallel initialization phase and a sequential iterative phase. In the initialization phase, the algorithm first considers a small subset and randomly samples a configuration for each task count in that subset. These configurations are evaluated in parallel to obtain their corresponding runtimes. The algorithm uses the resulting data points as a training set to build the initial predictive model. In the iterative phase, the algorithm uses the model to find high-quality parameter configurations with shorter predicted runtime for evaluation. The key here is reusing those evaluated configurations to improve the accuracy of the predictive model from the first phase.

One of the first questions the Argonne/NCAR team investigated was why the analytical model used for CESM fails to predict the runtime of the CICE component adequately. Using a machine-learning method as a diagnostic tool, the researchers analyzed the effect of several load-balancing parameters: three integer parameters that specify the number and size of a block, two configuration parameters that determine the decomposition strategy, and a binary parameter that specifies whether the code is to be run with or without synchronizing the array elements surrounding the local grid boundaries (called halos or ghost cells). The tests were run with task counts (corresponding to the number of Message Passing Interface tasks) of 80, 128,160, 256, 320, 376, 512, 640, 800, and 1,024.

To our surprise, the results showed that the impact of parameter values on the runtimes and the type of nonlinear interactions between them changes with an increase in the task counts,” said Prasanna Balaprakash, an assistant computer scientist at Argonne. And the trend is not consistent: some parameters have less effect at large task counts, others have more effect, and some even show no impact.”

According to Sven Leyffer, a senior computational mathematician at Argonne, this is the first work on the use of machine learning approaches for analyzing the sensitivity of the load-balancing parameters. The previous model did not take this impact into account for the CICE component.

Another contribution of the Argonne/NCAR team was an evaluation of four variants of their machine-learning-based algorithm and random search, using expert-knowledge-based enumeration (EE) – the current practice – as the baseline. The objective was to find the optimal load-balancing parameter configuration with the shortest runtime. The results showed that, compared with EE and random search, the machine-learning-based load-balancing algorithm requires six times fewer evaluations to find the optimal configuration.

The researchers are confident that since the CICE allocation of processors affects the overall performance of the CESM, the new algorithm will improve the overall scaling of the CESM. Moreover, the algorithm does not require a priori knowledge of how to prune the search space – a feature that will be particularly useful when applying load-balancing strategies to other newly deployed components.

The tests were run on the IBM Blue Gene/P supercomputer at Argonne National Laboratory. The researchers emphasize, however, that the algorithm is general and not specific to either the CESM or the Blue Gene/P. By using the new algorithm, climatologists can run simulations more rapidly and efficiently on other high-performance architectures.

For further information, see the paper

Machine-learning-based load balancing for community ice code component in CESM,” Prasanna Balaprakash, Yuri Alexeev, Sheri Mickelson, Sven Leyffer, Robert Jacob, Anthony Craig, in Proceedings of the 11th International Meeting on High-Performance Computing for Computational Science (VECPAR 2014)