On Determining a Viable Path to Resilience at Exascale
Exascale computing is projected to feature billion core parallelism. At such large processor counts, faults will become more common place. Current techniques to tolerate faults focus on reactive schemes for recovery and generally rely on a simple checkpoint/restart mechanism.
Yet, they have a number of shortcomings:
- (1) They do not scale and require complete job restarts.
- (2) Projections indicate that the mean-time-between-failures is approaching the overhead required for checkpointing.
- (3) Existing approaches are application-centric, which increases the burden on application programmers and reduces portability. To address these problems, we discuss a number of techniques and their level of maturity (or lack thereof) to address these problems.
- (a) scalable network overlays,
- (b) on-the-fly process recovery,
- (c) proactive process-level fault tolerance,
- (d) redundant execution,
- (e) the effort of SDCs on IEEE floating point arithmetic and (f) resilience modeling.
- In combination, these methods are aimed to pave the path to exascale computing.
Frank Mueller is a Professor in Computer Science and a member of multiple research centers at North Carolina State University. Previously, he held positions at Lawrence Livermore National Laboratory and Humboldt University Berlin, Germany. He received his Ph.D. from Florida State University in 1994. He has published papers in the areas of parallel and distributed systems, embedded and real-time systems and compilers.
He is a member of ACM SIGPLAN, ACM SIGBED and a senior member of the ACM and IEEE Computer Societies as well as an ACM Distinguished Scientist. He is a recipient of an NSF Career Award, an IBM Faculty Award, a Google Research Award and a Fellowship from the Humboldt Foundation.