Analysis and Architecture Aware Optimizations on the K Computer
High performance computing systems are increasingly complex. The K Computer, ranked 4th in the latest Top500 is a typical example of such complexity. Each compute node contains 8 cores sharing multiple memory levels in a hierarchical way (and with additional features), while the Tofu network topology linking those nodes exhibits 6 dimensions. To make the best of such systems, new performance analysis methods and optimizations techniques, taking advantage of specific architecture features, might be required.
In this talk we will present examples of performance optimizations at two levels of the K Computer: inside the compute node, with memory optimizations using the hardware cache partitioning facility provided by the SPARCVIIIfx, and between compute nodes, with a load balancing method for work stealing across the Tofu topology. We will support these optimizations with new performance analysis methods for cache locality and load balancer latency. Finally, we will argue that while fundamentally architecture-specific, these optimizations can be ported to other systems easily.