Argonne National Laboratory

Upcoming Events

Probabilistic Fault Detection and Diagnosis in Large-Scale HPC Applications

Ignacio Laguna, School of Electrical and Computer Engineering, Purdue University
October 10, 2012 10:30AM to 11:30AM
Building 240, Room 1406-1407
Debugging large-scale HPC applications is challenging. Faults can come from hardware malfunctions, software bugs or unexpected runtime conditions. In addition, some faults only manifest at large scale when the application is executed with a large number of processes or with a large input data set. Most of the existing debugging tools scale poorly, and more importantly, they do not automate the process of finding the origin of failures; the developer have to manually inspect the state of a large number of processes to find the root-cause of problems.

This talk will present a probabilistic technique to detect and diagnose faults in large-scale parallel applications. Ignacio will present a methodology to model historic control-flow and timing information of MPI tasks using a semi-Markov model. When a failure occurs, his technique determines the faulty task(s) and code region(s) where the problem is first manifested. The technique isolates abnormal tasks and code regions by clustering MPI behavioral models and then by finding 'outliers' within task clusters. He has implemented this technique in a tool called AutomaDeD and he has evaluated it against fault injections in the Sequoia and the NAS Parallel Benchmarks; AutomaDeD is able to identify the origin of faults 85% of the time. He also will show how AutomaDeD isolates in a few seconds the origin of a difficult-to-catch bug in a large scale molecular dynamics simulation code. The scalability of his technique has been demonstrated with over 32,000 MPI tasks in a BlueGene/P s ystem.

Ignacio Laguna is a PhD Candidate at Purdue University in the School of Electrical and Computer Engineering working under the supervision of Professor Saurabh Bagchi. His research interests include fault detection and diagnosis in large-scale distributed applications and machine-learning for anomaly detection. He received the ACM & IEEE George Michael Memorial HPC Fellowship in 2011; this award honors exceptional PhD students throughout the world whose research focus area is HPC.