Abstract: As high-performance computing (HPC) applications have become mainstream in both industry and academia, it has been frequently acknowledged that building and programming such systems will be challenging as they become more complex and heterogeneous. The inherent complexity of these applications begs the question of how to suitably specify correctness requirements given a wide array of potential vectors to introduce inaccuracies.
Additionally, HPC systems are primarily data driven; hence, a significant time is spent in either extensive numerical computations or data movement. Round-off errors introduced due to every floating-point operation become a critical component of correctness specification at such massive scale. While there has been a slew of research directed toward obtaining rigorous round-off error bounds, they have not scaled beyond a few dozen operators to be effectively applied to even smaller kernels of HPC subblocks. Furthermore, the adoption of system resilience solutions has been severely affected by performance impacts and high false positive rates that aggravate the problem of judiciously identifying error sources even further.
In this talk, we discuss a scalable yet rigorous technique for analyzing floating-point applications. Furthermore, we show how this technique can be further made effective for synthesizing error detection strategies. In particular, we point out how soft-error detection methods can help guard against incorrect polyhedral compilations that also may create aberrant values. Our methodology improves the current state of the art by four orders of magnitude.
Next, we introduce a novel synthesis technique to enable analysis over codes with branch conditions. This enables us to extend our rigorous analysis technique to be applicable to conditional codes prevalent in many geometric libraries and control software.
Last, we show how such analysis techniques can be suitably geared toward application-specific error detector synthesis. To this end, we exploit the floating-point behavior of applications (in our case for stencils) to efficiently synthesize detectors that are optimally placed for detecting logical and soft errors with robust precision guarantees.
Bio: Ganesh Gopalakrishnan is Director of the Center for Parallel Computing and Professor of Computer Science at the University of Utah. Arnab Das is a PhD Candidate at the University of Utah.