Improving Data Movement on Large HPC Systems

Identifying strategies and architectures (network and I/O) for improving commication on large-scale supercomputers, focusing on climate science and artificial intelligence workloads.

Many important scientific discoveries rely on supercomputers to solve data-intensive problems. These problems require moving large amounts of data for processing, storage, or visualization. AI and climate science are two such domains that experience significant data movement challenges for training deep learning models and generating climate projections, respectively. Data movement must be carefully optimized based on the requirements of the current workloads and on the capabilities of the network and storage infrastructure, such as transfer speeds. Optimizing data movement is challenging, however, because of the complexity of the workloads and infrastructure. Furthermore, most optimizations being applied to current workloads will not be ideal for future workloads due to changes in the workloads’ data requirements and the system capabilities.

This work aims to (1) provide a deeper understanding of data movement bottlenecks in current AI and climate science workloads on DOE systems through holistic performance analysis, (2) model and simulate future system capabilities that can be leveraged by data-intensive workloads, including smarter routers and faster storage solutions; and (3) investigate new data movement optimizations for emerging workloads in order to best leverage the capabilities and configurations of future infrastructures. Through this work, a more simplified performance analysis workflow will be developed to help scientists further accelerate data movement in their current applications. Additionally, this work will develop optimizations techniques that should be employed in emerging workloads and recommend productive configurations of future infrastructures. The results of this work will help speed up the AI and climate science efforts of today and tomorrow.