Argonne National Laboratory

Upcoming Events

LigHTS: The Case for Limping-Hardware Tolerant Systems

Haryadi Gunawi, University of Chicago
July 22, 2013 10:30AM to 11:30AM
Building 240, Room 4301
With the advent of scalable parallel computing, thousands of devices are connected and managed collectively. This era is confronted with a new challenge: performance failure. In this talk, we highlight one overlooked cause of performance failure: "limpware" -- hardware whose performance degrades ("limps") significantly compared to its specification.

In our preliminary work, we measure the system-level impact of limpware on five scale-out systems (Hadoop, HDFS, ZooKeeper, Cassandra, and HBase) and found that limpware can severely impact distributed operations, nodes, and an entire cluster. From the results, we introduce the concept of "limplock", a situation where a system is "locked" in a limping mode due to the presence of limpware and is not capable of failing over to healthy components. We show how each system that we analyzed can exhibit operation, node, and cluster limplock. We conclude that many scale-out systems are not limpware tolerant.

To address this issue, I will describe our LigHTS project, specifically our three major objectives. First, we plan to study limpware characteristics via logs analysis and instrumentation. Second, we continue to analyze the system-level impact of limpware in order to unearth design flaws and provide valuable reevaluations of how future systems should evolve. Finally, we will establish limpware-tolerant principles and apply them in building prototypes of cross-layer LigHTS spanning distributed storage, computing framework, operating and runtime systems.

Haryadi Gunawi is a Neubauer Family Assistant Professor in the Department of Computer Science at the University of Chicago where he leads the UCARE research group. He received his Ph.D. in Computer Science from the University of Wisconsin, Madison in 2009. He was a postdoctoral fellow at the University of California, Berkeley from 2010 to 2012. His research focuses on the reliability storage systems. He has won numerous awards including an Honorable Mention for the 2009 ACM Doctoral Dissertation Award, a co-winner of the 2009 departmental best thesis award, and the 2010 NSF Computing Innovation Fellowship.