Abstract: Building efficient and scalable system software, especially performance analysis and monitoring, for large-scale systems, is increasingly important both for the developers of parallel applications and the designers of next-generation high-performance computing (HPC) systems. However, conventional performance tools suffer from significant time/space overhead due to ever-increasing problem size and system scale. On the other hand, the cost of source code analysis is independent of the problem size and system scale, making it very appealing for large-scale performance analysis. Inspired by this observation, we have designed a series of lightweight system software for HPC systems, such as a memory access monitoring tool, a performance variance detection tool, and a communication trace compression tool. In this talk, I will share our experience in building these tools through combining static analysis and runtime analysis and also point out the main challenges in this direction.
Bio: Jidong Zhai is a tenured associate professor in the computer science department of Tsinghua University. His research interests include high-performance computing, performance evaluation, compiler, and heterogeneous computing.