Understanding Asynchronous Noise: Towards Exascale Resilience
Understanding the sources and impact of computational noise on application performance is a growing concern for the HPC community. Put simply, it is anticipated that next-generation HPC architectures will be characterized by inherent load imbalances arising from a broad range of noise sources. Details are discussed at length by Brown et. al., Snir et. al. and references therein. The net effect though is well understood - for applications, equal node work will not in general equate to equal execution time, and thus bulk synchronous algorithmic formulations will experience exceptional performance degradation.
Obviously, these effects are undesirable, and require us to understand noise-algorithm interaction with some depth in order to mitigate them. A simple model of noise with an adjustable level of asynchrony is presented. The model is used to generate synthetic noise traces in the presence of a representative nearest neighbor bulk synchronous, time stepping algorithm. The resulting performance of the algorithm is measured and compared to the performance of the algorithm in the presence of gaussian distributed noise. The results empirically illustrate that asynchrony is a dominant mechanism by which many types of computational noise degrade the performance of bulk-synchronous algorithms, whether or not their macroscopic noise distributions are constant or random.