Cheriere, Nathanael; Dorier, Matthieu; Antoniu, Gabriel
Efficient resource utilization becomes a major concern as large-scale distributed computing infrastructures keep growing in size. Malleability, the possibility for resource managers to dynamically increase or decrease the amount of resources allocated to a job, is a promising way to save energy and costs. However, state-of-the-art parallel and distributed storage systems have not been designed with malleability in mind. The reason is mainly the supposedly high cost of data transfers required by resizing operations. Nevertheless, as network and storage technologies evolve, old assumptions about potential bottlenecks can be revisited. In this study, we evaluate the viability of malleability as a design principle for a distributed storage system. We specifically model the minimal duration of the commission and decommission operations. To show how our models can be used in practice, we evaluate the performance of these operations in HDFS, a relevant state-of-the-art distributed file system. We show that the existing decommission mechanism of HDFS is good when the network is the bottleneck, but can be accelerated by up to a factor 3 when storage is the limiting factor. We also show that the commission in HDFS can be substantially accelerated. With the highlights provided by our model, we suggest improvements to speed both operations in HDFS. We discuss how the proposed models can be generalized for distributed file systems with different assumptions and what perspectives are open for the design of efficient malleable distributed file systems.