In this thesis, distributed storage and computing systems are viewed through a coding-theoretic lens. The role of codes in providing resiliency against noise has been studied for decades in many other engineering contexts, especially in communication systems, and codes are parts of our everyday infrastructure such as smartphones, WiFi, cellular systems, etc. Since the performance of distributed systems is significantly affected by anomalous system behavior and bottlenecks, which we call “system noise”, there is an exciting opportunity for codes to endow distributed systems with robustness against such system noise.
Our key observation – channel noise in communication systems is equivalent to system noise in distributed systems – forms the key motivation of this thesis, and raises the fundamental question: “can we use codes to guarantee robust speedups in distributed storage and computing systems?”. In this thesis, three main layers of distributed computing and storage systems – storage layer, computation layer, and communication layer – are robustified through coding-theoretic tools. For the storage layer, we show that coded distributed storage systems allow faster data retrieval in addition to the other known advantages such as higher data durability and lower storage overhead; for the computation layer, we inject computing redundancy into distributed algorithms that are robust to stragglers or nodes that are substantially slower than the other nodes; for the communication layer, we propose a novel data caching and communication protocol, based on coding-theoretic principles that can significantly reduce the network overhead of the data shuffling operation, which is necessary to achieve higher statistical efficiency when running parallel/distributed machine learning algorithms.