Skip to main content
Publication

A Zoom-in Analysis of I/O Logs to Detect Root Causes of I/O Performance Bottlenecks

Authors

Wang, Teng; Byna, Suren; Lockwood, Glenn; Snyder, Shane; Carns, Philip; Kim, Sunggon; Wright, Nicholas

Abstract

Scientific applications frequently spend a large fraction of their execution time in reading and writing data on parallel file systems. Identifying these I/O performance bottlenecks and attributing root causes are critical steps toward devising optimization strategies. Several existing studies analyze I/O logs of a set of benchmarks or applications that were run with controlled behaviors. However, there is still a lack of general approach that systematically identifies I/O performance bottlenecks for applications running in the wild” on production systems. In this study, we have developed an analysis approach of zooming in” from platform-wide to application-wide to job-level I/O logs for identifying I/O bottlenecks in arbitrary scientific applications. We analyze the logs collected on a Cray XC40 system in production over a two-month period. This study results in several insights for application developers to use in optimizing I/O behavior.