As we approach the exascale era, high-performance computing systems are expected to have billions of processing elements. The complexity arising from so many processors will increase error rates significantly. And among the most dangerous errors are silent data corruptions (SDCs) – unexpected errors that may corrupt the results but cannot be detected by hardware.
To remedy this situation, researchers from Argonne National Laboratory, Pacific Northwest National Laboratory, and two research organizations in Spain joined forces to create an online framework, called MACORD, that uses machine learning algorithms for corruption detection in high-performance computing applications.
Several significant challenges faced the developers of MACORD. First, they needed to determine which algorithms can be applied online. They tackled this challenge by analyzing the training cost of 11 state-of-the-art supervised learning algorithms under different metrics. From these they selected five algorithms with the lowest training costs.
“Five wasn’t a magic number: any online-applicable algorithm can be directly added to MACORD,“ said Sheng Di, an assistant computer scientist in Argonne’s Mathematics and Computer Science (MCS) Division. He cautioned, however, that most algorithms are too costly to be applied online.
The researchers also evaluated different metrics for dynamic selection of the best-fitting algorithms, including root mean square error, mean absolute error, and root mean square error division. The results showed no tangible effect of the metric in either error-free executions or executions with faults injected.
Another challenge facing the researchers was error distribution. Because no information is available about how silent errors will exhibit themselves, the researchers tested four error distributions to cover reasonable scenarios that may occur in future high-performance computing systems. Their experiments showed that different learning algorithms do perform differently across the different distributions.
Armed with this information, the research team performed a series of experiments on the MareNostrum supercomputer at the Barcelona Supercomputing Center in Spain, using scientific applications involving hydrodynamics, computational fluid dynamics, and heat distribution. Based on comparison with two state-of-the-art SDC detectors, the novel MACORD framework achieved better detection sensitivity (up to 99%) while experiencing only 0.1% of false positive rate in most cases.
“Of course, while accuracy and prediction capability are critical, so too is memory overhead,” said Prasanna Balaprakash, a computer scientist in the MCS Division. “We decided against using temporal techniques because they typically maintain several data values for each data point, incurring enormous overhead. Instead, we adopted a spatial technique that relies only on the snapshot data at the current time step. As a result, MACORD incurs less than 1% memory overhead.”
The researchers also emphasized MACORD’s adaptivity. Unlike their previous work using a fixed learning algorithm, they designed MACORD to have an adaptive framework that automatically selects the best-fit algorithms at runtime to adapt to the data dynamics. Because of this adaptive design, MACORD’s detection ability increases significantly.
“To the best of our knowledge, this is the first learning-based scheme leveraging different supervised learning algorithms to find the best-fit algorithm in detecting SDCs for HPC applications,” said Di.
For further information, see the full paper: “MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection,” by O. Subasi, S. Di, P. Balaprakash, O. Unsal, J. Labarta, A. Cristal, S. Krishnamoorthy, and F. Cappello, in Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER), Sept. 2017. http://ieeexplore.ieee.org/document/8049008/.