This dissertation characterizes two causes of variability in a large storage system: soft error behavior and disk drive heterogeneity. The first half of the dissertation focuses on understanding the error behavior and component failure characteristics of a storage prototype. The prototype is a loosely coupled collection of Pentium machines; each machine acts as a storage node, hosting disk drives via the SCSI interface. Examination of long term system log data from this prototype reveals several interesting insights. In particular, the study reveals that data disk drives are among the most reliable components in the storage system and that soft errors tend to fall into a small number of well defined categories. An in-depth study of hard failures reveals data to support the notion that failing devices exhibit warning signs and investigates the effectiveness of failure prediction.

The second half of the dissertation, dealing with disk drive heterogeneity, focuses on a new measurement technique to characterize disk drives. The technique, linearly increasing strides, counteracts the rotational effect that makes disk drives difficult to measure. The linearly increasing stride pattern interacts with the drive mechanism to create a latency vs. stride size graph that exposes many low level disk details. This micro-benchmark extracts a drive's minimum time to access media, rotation time, sectors/track, head switch time, cylinder switch time, number of platters, as well as several other pieces of information. The dissertation describes the read and write versions of this micro-benchmark, named Skippy, as well as analytical models explaining its behavior, results on modern SCSI and IDE disk drives, techniques for automatically extracting parameter values from the graphical output, and extensions.




Download Full History