Skip to main content
Mathematics and Computer Science Division

Argo: An exascale operating system

New exascale operating system and runtime system designed to support extreme-scale scientific computation
Architecture overview of Argo

Architecture overview of Argo

Argo is a new exascale operating system and runtime system designed to support extreme-scale scientific computation. It is built on an agile, new modular architecture that supports both global optimization and local control. It aims to efficiently leverage new chip and interconnect technologies while addressing the new modalities, programming environments, and workflows expected at exascale. It is designed from the ground up to run future high-performance computing applications at extreme scales.

At the heart of the project are four key innovations: dynamic reconfiguring of node resources in response to workload, allowance for massive concurrency, a hierarchical framework for power and fault management, and a beacon” mechanism that allows resource managers and optimizers to communicate and control the platform. These innovations will result in an open-source prototype system that runs on several architectures. It is expected to form the basis of production exascale systems deployed in the 20182020 timeframe.

The design is based on a hierarchical approach. A global view enables Argo to control resources such as power or interconnect bandwidth across the entire system, respond to system faults, or tune application performance. A local view is essential for scalability, enabling compute nodes to manage and optimize massive intranode thread and task parallelism and adapt to new memory technologies. In addition, Argo introduces the idea of enclaves,” a set of resources dedicated to a particular service, and capable of introspection and autonomic response. Enclaves will be able to change the system configuration of nodes and the allocation of power to different nodes or to migrate data or computations from one node to another. The enclaves will be used to demonstrate the support of different levels of fault tolerance – a key concern of exascale systems – with some enclaves handling node failures by means of global restart and other enclaves supporting finer-level recovery.