Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions
Efficient load balancing methods are required to obtain scalability in many scientific software applications. One such application is NWChem's coupled-cluster module, which allows for detailed study of chemical problems by iteratively solving the Schrodinger equation with an accurate ansatz. In this case, relevant task information can be obtained just before execution with negligible cost, which suggests a static mapping of task groups to processors can be a simple and more efficient alternative to centralized dynamic load balancing.
The distributed tensor contractions are block sparse, and an a priori inspection can quickly assign cost estimations to tasks based on characteristics such as their dimensions. Architecture-specific and empirically driven performance models of the dominant SORT and DGEMM routines serve as a cost estimator for a once-per-simulation static partitioning process. This inspector/executor technique has been demonstrated, improving the NWChem coupled-c luster module’s execution time by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.