La VALSE -- it’s not the waltz your grandmother may have enjoyed, choreographed by Ravel in the early 1900s. In fact, unlike Ravel’s musical composition of the same name, La VALSE is an innovative approach to scalable visualization. It is intended for identifying the sources of failure events in supercomputers.
Such failures arise from a variety of causes — hardware, system software, file systems, power — and they are a significant problem in today’s high-performance computing, causing unexpected termination of job during execution. One of the most important ways to study such failures is to analyze the logs generated by different components in the supercomputers.
Researchers from the Mathematics and Computer Science (MCS) division at Argonne National Laboratory have designed La VALSE to do just that.
The La VALSE framework comprises multiple linked views to visualize so-called RAS (for reliability, availability, and serviceability) logs. For the timeline view, the researchers created a scalable version of the popular ThemeRiver tool so that it can highlight individual messages and increase the dynamic range of the input volumes; the timeline view also features arc diagrams, enabling interactive exploration of tens of millions of RAS logs. The spatial view visualizes the occurrences of RAS messages on hundreds of thousands of elements of the supercomputer — including node boards and racks. And the multidimensional view enables interactive filtering of different categories such as severity; for this view the researchers developed a scalable online data cube engine that can query 55 million RAS logs in less than one second. Not only can the query engine run on a single machine, but it also is scalable to distributed and parallel environments.
“La VALSE makes it easy to trace causes of failure events and to correlate errors that occur in different categories, times, and locations,” said Hanqi Guo, an assistant computer scientist in the MCS division.
In developing La VALSE, the researchers faced several challenges. RAS logs are noisy, so key elements can be obscured by using radiational visualizations. Moreover, the logs are heterogeneous, with distinct data structures.
The sheer number of RAS messages generated by the supercomputer is also a challenge. Mira, the IBM Blue Gene/Q supercomputer at Argonne, has generated 55 million RAS messages in the past five years. “Traditional visualizations simply cannot render so many messages at interactive speed,” said Tom Peterka, a computer scientist in the MCS Division. “To the best of our knowledge, La VALSE is the first visualization framework that addresses these challenges.”
The Argonne researchers have successfully used La VALSE in several case studies on Mira. In one case, a user explored a burst of network errors; by zooming in on La VALSE’s timeline view, the user was able to identify the root cause. In another case, using La VALSE’s physical view, a user discovered spatial correlations among RAS messages in different scales and noted how messages propagate from one spot through the network. The information in both instances helps system administrators narrow the range of their error diagnoses and even identify the sources of failure events.
The researchers emphasize that although the current implementation of La VALSE is tailored for the Mira supercomputer, the scalable design of La VALSE can be extended to analyze logs in other systems. For IBM Blue Gene systems, the extension is straightforward since the system hierarchy is similar and RAS logs follow almost the same protocol. In other systems, the user may need further work. To help in these efforts, the researchers plan to generalize La VALSE to visualize logs for other supercomputers and clusters, such as Cray systems and Linux clusters
And just as numerous versions of the waltz form developed in the 20th century, so too do the Argonne researchers plan to expand the design of La VALSE. “We want to add I/O, performance, and communication logs to support a comprehensive analysis over the whole machine,” said Guo.
For the full report of this research, see the MCS Division website http://www.mcs.anl.gov/~hguo/publications/GuoDGPC18.pdf:
H. Guo, S. Di, R. Gupta, T. Peterka, and F. Cappello, “La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers,” in H. Childs and F. Cucchietti (Eds.), Proceedings of the Eurographcs Symposium on Parallel Graphics and Visualization, 2018, doi-10.2312/pgv.20181099.
For the link to the open source software, see https://github.com/hguo/LaValse.