A Double Bit Error Detection and Recovery Library for Accelerators

Xiuxia Zhang, Chinese Academy of Sciences, Institute of Computing Technology
May 30, 2014 1:30PM to 2:30PM
Building 240, Room 4301
Reliability is crucial to high performance computer systems, and it becomes more important as systems grow larger and more complex. Lots of work has been done in order to build more reliable computer systems. However, accelerators still lack automatic soft error detection and handling features. For example, even the latest GPUs do not include double bit error correcting mechanisms. In this talk, we present a new solution designed and developed to address these issues. We built a fault detection and automatic recovery system into VOCL, a virtualization layer for OpenCL.

This layer performs automatic ECC error detection and correction for double bit errors. We also provide optimizations for our double bit error detection scheme. We evaluate our solution on the Hokiespeed cluster at Virginia Tech. Results show that the overhead of our error detection and data protection schemes is comparable to previous work and avoids much of the unnecessary data replication during checkpointing when applying our optimizations while providing automatic recovery mechanisms which were previously unavailable.

Xiuxia Zhang is a second year PhD. candidate from Chinese Academy of Sciences, Institute of Computing Technology, working in Mingyu Chen's group. She has been working at Argonne with Pavan Balaji's group for 12 months. Her research interests include application tuning on accelerators, parallel algorithms, and compiler techniques.