Fault tolerance is a cross-cutting issue that spans all layers of the hardware/software stack. FTS 2015, held in Chicago Sept. 8 in conjunction with IEEE Cluster 2015, aimed at providing a venue for researchers to share experiences across the different layers and get a holistic view of fault tolerance techniques.
Despite the title of the workshop, Cappello began by stating, “Let’s forget about ‘fault tolerance’ for high-performance computing.” He argued that fault tolerance has been dealing with issues such as process crashes, with the objective of protecting HPC executions. “What matters,” he maintained, “is also to protect the correctness of the results that such executions produce.”
Cappello, a senior scientist in Argonne’s Mathematics and Computer Science Division, emphasized that the new challenge in reliability is trust. He reviewed the types of disruptions leading to corruption of application results, the ways that users build trust in such results, and the limitations of current techniques — including fault tolerance, validation and verification, and uncertainty quantification.
On a more positive note, he also presented two approaches to improve result trustworthiness. The “external algorithmic observer” approach involves execution of a model of the data transformation performed by the application. The second approach, based on “trust relations,” requires establishing trust individually for each hardware and software component involved in the execution.
For more information about Cappello’s keynote address and the other presentations at the workshop, see the website.