Detecting Silent Data Corruption through Time Series Prediction in HPC Applications
As the HPC community moves towards exascale, new resilience challenges will arise. For instance, future hardware might be unable to detect some bit-flips corrupting the application data, producing what is called Silent Data Corruption (SDC). These errors can cause significant changes in applications' output at the end of the simulation, making their detection and correction a pressing problem.
Fortunately, we have observed that for most of HPC applications, data values over time don't change drastically from one time step to the next. This property makes it possible to use one-step ahead prediction methods to detect possible data corruptions by looking for deviations that are 'far enough' from these predictions. In this talk, I will present a SDC detection method and show its effectiveness in detecting bit-flips for HACC, Nek5k and a turbulence CFD kernel. We will also discuss some of the current limitations, as well as future directions for this work.
Eduardo Berrocal is a Research Assistant and Ph.D. Candidate in the Scalable Computing Software Laboratory (SCS) at the Illinois Institute of Technology (IIT) in Chicago. He received his BS and MS in Computer Science from the Polythecnic University of Madrid (Spain) in 2008, and his MS in Computer Science from IIT in 2009. His current research interest is focused on Data Analytics for High Performance Computing. He is also part of the CUDA Teaching Center at IIT, where he gives lectures and conducts workshops on GPU programming.