Zhao, Kai; Di, Sheng; Li, Sihuan; Liang, Xin; Zhai, Yujia; Chen, Jieyang; Ouyang, Kaiming; Cappello, Franck; Chen, Zizhong
Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problemsin many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused byhigh-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inferenceprocess against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code isunable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based faulttolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNNinference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematicABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditionalABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflowintegrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform ourevaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental resultsdemonstrate that our implementation can handle soft errors with very limited runtime overhead (4%8% in both error-free anderror-injected situations).