VeloC: Very Low Overhead transparent multilevel Checkpoint/restart
VeloCCurrently, the majority of high-performance computing applications use ad hoc checkpoint/restart for fault tolerance. However, the addition of nonvolatile memory and burst buffers in systems will require nontrivial code modifications to handle the diversity of architectures and to realize benefits from these new levels of storage. Dozens of complex applications will need significant modifications. Moreover, without specific optimizations for fast restart from checkpoints stored on nonvolatile memory, the execution will potentially experience long delays in retrieving the checkpoint on the file system.
The VeloC project addresses these challenges by providing a framework offering a simple checkpoint/restart interface and by transparently providing the benefit of multilevel checkpointing to exascale applications. To attain this objective, the research and development effort focuses on refactoring of the two checkpoint libraries, Fault Tolerance Interface (FTI) and Scalable Checkpoint/Restart (SCR), into the VeloC framework.