Argonne National Laboratory

Upcoming Events

Using Massively Parallel Simulation for Extreme-Scale Network Co-Design

Misbah Mubarak, Postdoc Interviewee
May 14, 2014 10:30AM to 11:30AM
Building 240, Room 4301
A key factor that largely determines the effectiveness of massively parallel systems is its inter-connect network. With future exascale systems having a potential size of 100,000 to 1 Million compute nodes, considerable research is in progress to determine a network topology that maximizes the bandwidth of a network under various traffic patterns. In this work we present a methodology for modeling and simulation of high-fidelity dragonfly (used by the Cray XC30 system) and torus (used by the IBM Blue Gene series, Cray XT, and Cray XE systems) network topologies at an exascale size using the Rensselaer Optimistic Simulation System (ROSS).

This work evaluates various configurations of a million-node torus network in order to examine the effect of torus dimensionality on network performance using relevant HPC traffic patterns. We also explore a million-node dragonfly network model and investigate its different configurations and routing algorithms. We then evaluate the performance of our simulation in order to demonstrate that we are able to efficiently execute large-scale network simulations on today's leadership class supercomputers. The accuracy of our torus and dragonfly network models is validated using empirical measurements from Blue Gene supercomputers and simulated results from the cycle accurate simulator booksim respectively.

The dragonfly and torus network models are part of an interconnect component of the 'enabling CO-Design of multi-layer Exascale Storage architectures (CODES)' simulation toolkit so that the CODES I/O and storage models can make use of these high fidelity networks as their underlying interconnect backbone.

Additionally, to effectively evaluate the communication behavior of large-scale scientific applications on exascale interconnects; work is in progress to introduce a network workload component in CODES that uses communication patterns from scientific applications to drive the CODES dragonfly and torus network simulations.