Description
In order to ensure data durability and service reliability, data needs to be stored redundantly. While the traditional approach towards this objective is to store multiple replicas of the data, today's unprecedented data growth rates mandate more efficient alternatives. Coding theory, and erasure coding in particular, offers a compelling alternative by making optimal use of the storage space. For this reason, many data-center scale distributed storage systems are beginning to deploy erasure coding instead of replication. This paradigm shift has opened up exciting new challenges and opportunities both on the theoretical as well as the system design fronts. Broadly, this thesis addresses some of these challenges and opportunities by contributing in the following two areas:
(1) Resource-efficient distributed storage codes and systems: Although traditional erasure codes optimize the usage of storage space, they result in a significant increase in the consumption of other important cluster resources such as the network bandwidth, input-output operations on the storage devices (I/O), and computing resources (CPU). This thesis considers the problem of constructing codes, and designing and building storage systems, that reduce the usage of I/O, network, and CPU resources while not compromising on storage efficiency.
(2) New avenues for erasure coding in big-data systems: In big-data systems, the usage of erasure codes has largely been limited to disk-based storage systems, and furthermore, primarily towards achieving space-efficient fault tolerance---in other words, to durably store "cold" (less-frequently accessed) data. This thesis takes a step forward in exploring new avenues for erasure coding---in particular for "hot" (more-frequently accessed) data---by showing how erasure coding can be employed to improve load balancing, and to reduce the (median and tail) latencies in data-intensive cluster caches.
An overarching goal of this thesis is to bridge theory and practice. Towards this goal, we present new code constructions and techniques that possess attractive theoretical guarantees. We also design and build systems that employ the proposed codes and techniques. These systems exhibit significant benefits over the state-of-the-art in evaluations that we perform in real-world settings, and are also slated to be a part of the next release of Apache Hadoop.