Abstract: Processing data from experiments on the Argonne Leadership Computing Facility (ALCF) supercomputer typically entails initiating transfers to ALCF with Globus or another protocol, logging in and writing a script to launch analysis jobs, and submitting the script to the system’s batch scheduler. In production scenarios, the management of many jobs over an extended time period requires significant human interaction. Balsam is a supercomputing workflow manager and edge service that automates this cycle and consolidates the workflow into a simple remote task submission command. Balsam users define their workflow, which might entail several data transfer and analysis stages, and submit instances as data is collected. The Balsam service continuously stages data transfers, bundles tasks for execution, submits jobs for execution on ALCF systems, and stages the results back. This talk will provide a high-level overview of Balsam and walk through a scenario running on Theta, an 11.69 petaflops Cray system at ALCF. We highlight a few production science use cases of Balsam, which have totaled hundreds of millions of core-hours on ALCF systems to date.
XSD/SDM Special Presentation