Erasure Coding for Big-data Systems: Theory and Practice

Vinayak, Rashmi

PDF

Description

Big-data systems enable storage and analysis of massive amounts of data, and are fueling the data revolution that is impacting almost all walks of human endeavor today. The foundation of any big-data system is a large-scale, distributed, data storage system. These storage systems are typically built out of inexpensive and unreliable commodity components, which in conjunction with numerous other operational glitches make unavailability events the norm rather than the exception.

In order to ensure data durability and service reliability, data needs to be stored redundantly. While the traditional approach towards this objective is to store multiple replicas of the data, today's unprecedented data growth rates mandate more efficient alternatives. Coding theory, and erasure coding in particular, offers a compelling alternative by making optimal use of the storage space. For this reason, many data-center scale distributed storage systems are beginning to deploy erasure coding instead of replication. This paradigm shift has opened up exciting new challenges and opportunities both on the theoretical as well as the system design fronts. Broadly, this thesis addresses some of these challenges and opportunities by contributing in the following two areas:

(1) Resource-efficient distributed storage codes and systems: Although traditional erasure codes optimize the usage of storage space, they result in a significant increase in the consumption of other important cluster resources such as the network bandwidth, input-output operations on the storage devices (I/O), and computing resources (CPU). This thesis considers the problem of constructing codes, and designing and building storage systems, that reduce the usage of I/O, network, and CPU resources while not compromising on storage efficiency.

(2) New avenues for erasure coding in big-data systems: In big-data systems, the usage of erasure codes has largely been limited to disk-based storage systems, and furthermore, primarily towards achieving space-efficient fault tolerance---in other words, to durably store "cold" (less-frequently accessed) data. This thesis takes a step forward in exploring new avenues for erasure coding---in particular for "hot" (more-frequently accessed) data---by showing how erasure coding can be employed to improve load balancing, and to reduce the (median and tail) latencies in data-intensive cluster caches.

An overarching goal of this thesis is to bridge theory and practice. Towards this goal, we present new code constructions and techniques that possess attractive theoretical guarantees. We also design and build systems that employ the proposed codes and techniques. These systems exhibit significant benefits over the state-of-the-art in evaluations that we perform in real-world settings, and are also slated to be a part of the next release of Apache Hadoop.

Details

Title

Erasure Coding for Big-data Systems: Theory and Practice

Creator

Vinayak, Rashmi, Author

Published

2016-09-14

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2016-155

Type

Text

Format

technical reports

Extent

165 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket