Abstract: Recent years have seen a fusion between deep learning (DL) and high-performance computing. Domain scientists are exploring and exploiting DL techniques for classification, prediction, and reduction of simulation dimensionality. These DL applications are naturally supercomputing applications given their computation, communication, and I/O characteristics.
In this talk, I will present two works that enable highly scalable distributed DL training. The first focuses on the layer-wise adaptive rate scaling algorithm and its application in ImageNet training on thousands of compute nodes with the state-of-the-art validation accuracy. The second enables efficient and scalable l/O for DL applications on supercomputers with FanStore, with which we are able to scale real-world applications to hundreds nodes on a CPU and GPU cluster with over 90% scaling efficiency.
Bio: Zhao Zhang is a compute scientist at Texas Advanced Computing Center. His current research focuses on scalable deep learning on supercomputers. He received Ph.D. in computer science from the University of Chicago.