Modern data centers have been providing exponentially increasing computing and storage resources, which have been fueling core applications ranging from search engines in the early 2000’s to the real-time, large-scale data analysis of today. All these breakthroughs were made possible only due to the scalability in computing and storage resources offered by modern large-scale clusters, comprising individually small and unreliable low-end devices. Given the individually unpredictable nature of the underlying devices in these systems, we face the constant challenge of securing predictable and high-quality performance of such systems in the face of uncertainty.

In this thesis, distributed storage and computing systems are viewed through a coding-theoretic lens. The role of codes in providing resiliency against noise has been studied for decades in many other engineering contexts, especially in communication systems, and codes are parts of our everyday infrastructure such as smartphones, WiFi, cellular systems, etc. Since the performance of distributed systems is significantly affected by anomalous system behavior and bottlenecks, which we call “system noise”, there is an exciting opportunity for codes to endow distributed systems with robustness against such system noise.

Our key observation – channel noise in communication systems is equivalent to system noise in distributed systems – forms the key motivation of this thesis, and raises the fundamental question: “can we use codes to guarantee robust speedups in distributed storage and computing systems?”. In this thesis, three main layers of distributed computing and storage systems – storage layer, computation layer, and communication layer – are robustified through coding-theoretic tools. For the storage layer, we show that coded distributed storage systems allow faster data retrieval in addition to the other known advantages such as higher data durability and lower storage overhead; for the computation layer, we inject computing redundancy into distributed algorithms that are robust to stragglers or nodes that are substantially slower than the other nodes; for the communication layer, we propose a novel data caching and communication protocol, based on coding-theoretic principles that can significantly reduce the network overhead of the data shuffling operation, which is necessary to achieve higher statistical efficiency when running parallel/distributed machine learning algorithms.




Download Full History