We analyze how modern distributed storage systems behave in the presence of storage faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous bugs related to storage fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from storage faults: a single storage fault can cause catastrophic outcomes such as data loss, corruption, and unavailability. Our results have implications for the design of next generation fault-tolerant distributed and cloud storage systems.
Program committee comment
We all heard that redundancy is necessary for fault tolerance in distributed systems. Turns out, redundancy alone is not sufficient in many practical use cases. In her talk, Aishwarya will go deep into explaining possible failure modes and how to avoid them.