Innovations in science and technology seem to be happening every day, with an increasing share of them resulting from scientific simulations on large-scale computers. As these systems progress toward the exascale, however, the probability of fatal events increases.
“Fatal events” is another term for system failures. Understanding these fatal events is critical, not only to scientific users, but also to systems managers and vendors. Yet the root causes of many of these events are hard to identify. Furthermore, developing an effective program to analyze fatal events on a large scale is difficult because of the sheer number of such events — as many as half a million a year.
One reason for the high incidence of fatal events is that a single event (for example, a coolant system problem) can affect many modules, leading to a large number of fatal messages. Another reason is that system administrators may generate fatal messages in order to complete maintenance checks.
A significant number of studies conducted to address the problem have focused on fatal events in small to medium-sized computer systems or over short periods of time or on errors related only to specific user applications. Researchers at Argonne National Laboratory have developed a system log analysis tool — called LogAider — to explore fatal events on the IBM Blue Gene/Q “Mira” computer system, one of the world’s largest supercomputers.
“We started with 60 million messages from the RAS (reliability, availability, serviceability) logs for the past five years on Mira,” said Sheng Di, a computer scientist in Argonne’s Mathematics and Computer Science (MCS) Division. “The number was overwhelming, but we gradually worked it down.”
First, the researchers removed events identified by RAS as nonfatal. That left 2.6 million logs — still far too many to analyze effectively. Recognizing that many of these were duplicates, the researchers then devised a three-stage filtering system in LogAider to remove the duplicated messages. The result: only 1,255 fatal events remained of the original 60 million. Next, the researchers set to work using LogAIder to analyze this remaining group.
“We had three main questions.” Di said. “Which components were most error prone? What is the temporal correlation between fatal events? And what is the spatial distribution of the events?”
LogAider proved to be a reliable and efficient tool in helping answer these questions. With respect to components, the researchers found that 80 percent of fatal events were related to only a few components: the monitor of the hardware, network controller, and machine controller running on the service node. This proportion is typical of the so-called 80-20 rule, also shown in studies on other architectures. In contrast to those studies, however, the results on Mira showed the fatal events striking mainly at the firmware rather than the CPU or Internet.
Analysis of temporal correlations focused on message IDs — the most important field in Mira that determines many other field values. Here the results indicated that the fatal message IDs tend to cluster more than do other message IDs, and they may have strong mutual correlations, particularly with warn message ID events. Moreover, the researchers determined a seasonal variability in the mean time between fatal events on Mira — a new insight made possible by the long-term nature of the study.
The initial spatial correlation analysis showed no clear locality correlation across racks of Mira over long periods. An analysis at finer granulrity, however, showed a relatively strong correlation between fatal events inside racks.
“These results, some quite startling, motivated us to develop a list of takeaways that we believe can help system administrators, vendors, and fault tolerance researchers better understand fatal events in extreme-scale systems,” said Hanqi Guo, an MCS assistant computer scientist. For example, one takeaway points out that while the popular Weibell probabiliity distribution gives the best fit for fatal event distribution, it may not be the best choice for theoretical fault tolerance analysis because of its complexity. Two other options are suggested as alternatives.
The researchers emphasized that although LogAider was originally designed for studying logs on the Mira computer system, they have sought to make the tool as generic as possible.
“LogAider’s customizability is key,” Di said. ”For example, LogAider includes an easy-to-use layout template that users can customize for other multilayer system architectures. And the codes are released as open source on GitHub so that users can modify them to fit their own demands, if necessary.”
For more details about this work, see
S. Di, H. Guo, R. Gupta, E. Pershey, and F. Cappello, “Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System,” IEEE Transactions on Parallel and Distributed Computing, 30(2): 361–374, 2019