«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
Efﬁcient, scalable consistency for
highly fault-tolerant storage
Dept. of Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
Submitted in partial fulﬁllment of the requirements
for the degree of Doctor of Philosophy.
Prof. Gregory R. Ganger, Chair
Prof. Michael K. Reiter
Prof. Priya Narasimhan
Richard Golding, IBM Research
c 2004 Garth Goodson · ii Efﬁcient, scalable consistency for highly fault-tolerant storage Keywords: survivable storage systems, Byzantine fault-tolerance, erasure codes To my parents, for their unwavering support.
· iv Efﬁcient, scalable consistency for highly fault-tolerant storage Abstract Fault-tolerant storage systems spread data redundantly across a set of storage-nodes in an effort to preserve and provide access to data despite failures. One difﬁculty created by this architecture is the need for a consistent view, across storage-nodes, of the most recent update. Such consistency is made difﬁcult by concurrent updates, partial updates made by clients that fail, and failures of storage-nodes.
This thesis demonstrates a novel approach to achieving scalable, highly fault-tolerant storage systems by leveraging a set of efﬁcient and scalable, strong consistency protocols enabled by storage-node versioning. Versions maintained by storage-nodes can be used to provide consistency, without the need for central serialization, and despite concurrency.
Since versions are maintained for every update, even if a client fails part way through an update, concurrency exists during an update, the latest complete version of the data-item being accessed still exists in the system—it does not get destroyed by subsequent updates.
Additionally, versioning enables the use of optimistic protocols.
This thesis develops a set of consistency protocols appropriate for constructing blockbased storage and metadata services. The block-based storage protocol is made spaceefﬁcient through the use of erasure codes and made scalable by ofﬂoading work from the storage-nodes to the clients. The metadata service is made scalable by avoiding the high costs associated with agreement algorithms and by utilizing threshold voting quorums.
Fault-tolerance is achieved by developing each protocol in a hybrid storage-node faultmodel (a mix of Byzantine and crash storage-nodes can be tolerated), capable of tolerating crash or Byzantine clients, and utilizing asynchronous communication.
· vi Efﬁcient, scalable consistency for highly fault-tolerant storage Acknowledgements I would ﬁrst like to thank my advisor, Greg Ganger,without whose support and insight this work would not be possible. Next, I would like to thank Mike Reiter, whose patience and wealth of distributed systems knowledge guided us through this project. I would also like to thank the rest of my thesis committee members, Richard Golding, and Priya Narasimhan for their valuable comments, feedback, and time.
From my group, at the PDL, my warmest thanks goes to Jay Wylie, my research partner and friend. Without him, I would still be ﬁlling white boards with meaningless scribble. Next, I would like to thank Mike Abd-El-Malek, our newest PASIS member. He came up to speed on the complicated code-base in an amazingly short amount of time and provided invaluable help in implementing the query/update protocol. Finally, from my group, I would like to thank those that made my time at CMU a very enjoyable experience. Especially those from previous projects: John Strunk and Craig Soules; as well
as Andy Klosterman, Jiri Schindler, Shuheng Zhou; and everyone else from my group:
Steve Schlosser, John Grifﬁn, Adam Pennington, Eno Thereska, Brandon Salmon. A special thanks goes to the PDL staff, especially to Karen Lindenfelser and Linda Whipkey for their fantastic organization and caring, positive attitude.
I would like to thank my family, especially my parents for providing their unending support. Last, but not least, I would like to thank my wonderful wife Vianey. I thank her for her patience and the sacriﬁces she made as she waited for me to ﬁnish my thesis. But, most of all, I would like to thank her for her support, friendship, and undying love.
I would like to thank the members and companies of the PDL Consortium (includ· viii Efﬁcient, scalable consistency for highly fault-tolerant storage ing EMC, Engenio, Hewlett-Packard, HGST, Hitachi, IBM, Intel, Microsoft, Network Appliance, Oracle, Panasas, Seagate, Sun, and Veritas) for their interest, insights, feedback, and support. This material is based on research sponsored in part by the National Science Foundation, via grant #CNS-0326453, by the Air Force Research Laboratory, under agreement number F49620–01–1–0433 and F30602–99–2–0539–AFRL, and by the Army Research Ofﬁce, under agreement number DAAD19–02–1–0389.
Finally, I would like to thank Miguel Castro and Rodrigo Rodrigues for making the implementation of BFT publicly available.
1.1 Problem deﬁnition Fault-tolerant storage systems (e.g., Petal [Lee and Thekkath 1996], Myriad [Chang et al.
2002], SwiftRAID [Long et al. 1994], and Cheops [Amiri et al. 2000a]) spread data redundantly across a set of storage-nodes in an effort to preserve and provide access to data despite failures. Figure 1.1 illustrates the
architecture of a fault-tolerant, or survivable, distributed storage system. In these types of systems it is common to break the system (at least logically) into two components or services: a metadata service; and a data service. To access or update a data-item, a client must obtain metadata about the data-item (e.g., where and how it is stored) before it is able to access the data-item itself. In order for a storage-system to tolerate failures, both the data and metadata must be duplicated across a set of storage-nodes. Thus, an update is only complete once it has completed successfully at a subset of the storage-nodes. While this scheme provides access to data-items and their metadata even when subsets of the storage-nodes have failed, it does create the difﬁculty of maintaining a consistent view, across the storage-nodes, of the most recent update. In decentralized systems, this is problem is exacerbated since, due to the lack of a central serialization point, updates may not be issued to the same subset of storage-nodes as are being read. Without consistency across the set of storage-nodes, data loss is possible or even likely.
Although protocols exist for achieving such consistency, they generally fall short in a number of areas including fault-tolerance, efﬁciency, and scalability. The easiest solution to this problem is to introduce a point of serialization. This is typically done by serializing · 2 Efﬁcient, scalable consistency for highly fault-tolerant storage
Figure 1.1: High-level architecture for survivable storage.
Spreading data and metadata redundantly across storage-nodes improves its fault-tolerance. Clients write and (usually) read data from multiple storage-nodes and may contact multiple storage-nodes to perform metadata operations.
requests through a primary. However, this typically reduces the scalability of the system and requires additional protocols to tolerate the failure of the primary. Other more complicated protocols exist, however they generally require a signiﬁcant amount of overhead in the common case of little or no concurrency. Most studies of distributed storage systems (e.g., [Baker et al. 1991; Noble and Satyanarayanan 1994]) indicate that concurrency is uncommon (i.e., writer-writer and writer-reader sharing occurs in well under 1% of operations). As well, many protocols do not scale in terms of messaging or protocol overhead, or storage-node CPU utilization as number of faults tolerated increases.
This thesis demonstrates a novel approach to achieving scalable, highly fault-tolerant (the ability to tolerate more than a single fault) storage systems by leveraging a set of efﬁcient and scalable, strong consistency protocols enabled by storage-node versioning.
Versioning storage-nodes keep a version of every update they receive (for some period of time). These versions can be used to provide consistency, without the need for central serialization, and despite concurrency. Since versions are maintained for every update, even if a client fails part way through an update, or a reader performs a query during an update, the latest complete version of the data-item being accessed still exists in the system—it does not get destroyed by subsequent updates. The problem with versioning becomes one ·
1.2 Thesis statement 3of the client locating the version it is interested in. The consistency protocols developed in this thesis all use logical time as a means of naming versions. Additionally, versioning enables the use of optimistic protocols. Since older versions are not overwritten by new updates, there is no need to lock the data-item before performing an update. However, in these protocols, concurrency may require the client to perform extra work to ﬁnd the set of versions in which they are interested (e.g., those that comprise the latest complete update).
In particular, this thesis develops a set of consistency protocols appropriate to building block-based storage and metadata services. The block-based storage protocol is made space-efﬁcient through the use of erasure codes and made scalable by ofﬂoading work from the storage-nodes to the clients. The metadata storage protocol is made scalable by avoiding the high costs associated with agreement algorithms and by utilizing threshold voting quorums. Fault-tolerance is achieved by developing each protocol in a hybrid storage-node fault-model (a mix of Byzantine and crashed storage-nodes can be tolerated), capable of tolerating crash or Byzantine clients, and utilizing asynchronous communication.
1.2 Thesis statement Versioning storage-nodes enable the design of a set of scalable, efﬁcient consistency protocols that provide a foundation for constructing scalable, highly fault-tolerant, distributed storage systems.
1.2.1 Validation This work is validated through the design and evaluation of three consistency protocols
that have been enabled by versioning storage-nodes. More precisely:
(1) It develops and demonstrates a read/write block storage consistency protocol that enables highly fault-tolerant storage through the use of erasure coded data and versioning storage-nodes. Its correctness is shown through proof sketches.
· 4 Efﬁcient, scalable consistency for highly fault-tolerant storage (2) It develops and demonstrates a read/conditional-write block protocol that allows for stronger read–modify–write consistency semantics. Additionally, tradeoffs between tolerating Byzantine clients and erasure coding, as well as tradeoffs between tolerating Byzantine storage-nodes and liveness are discussed.
(3) It extends the read/conditional-write block protocol to support operations on multiple, arbitrary objects and implements a scalable metadata service based upon this and the read/write protocol.
(4) It evaluates a distributed ﬁle system that utilizes the scalability and fault-tolerance of the developed consistency protocols in terms of the number of faults tolerated, the maximum throughput the system can sustain, and its performance in degraded operation modes (i.e., with concurrency and faults).
1.3 Overview 1.3.1 Consistency protocols First, the Read/Write protocol (R/W) is developed. It provides strong consistency and fault-tolerance for read/write block storage. Block storage-systems (e.g., SCSI, ﬁbrechannel) provide the backbone for most current storage solutions.
The R/W protocol works roughly as follows. To perform a write, clients write timestamped fragments to at least a write threshold of storage-nodes. Storage-nodes keep all versions of fragments they are sent. To perform a read, clients fetch the latest fragment versions from a read threshold of storage-nodes. The client determines whether the fragments comprise a consistent, complete write, based on timestamp ordering; usually, they do. If they do not, additional fragments or historical fragments are fetched, or repair is performed, until a consistent, complete write is observed. Only in cases of failures (storage-node or client) or read-write concurrency is additional overhead incurred to maintain consistency.
The second protocol, the Read/Conditional Write protocol (R/CW), extends the R/W ·
1.3 Overview 5 protocol by only allowing write operations to complete if the object has not changed since the last time it was read. Thus, the R/CW protocol is able to provide stronger consistency semantics—similar to read–modify–write, rather than just read–write semantics.
Finally, a Query/Update protocol (Q/U) is developed. It is very similar to the R/CW protocol but adds a few optimizations and extensions. Most notably, it provides strict serializability of arbitrary operations through the use of replicated state machines.
These protocols are developed in detail, evaluated individually, and used as a basis for building a fault-tolerant, scalable storage-system.
1.3.2 Guiding assumptions
These consistency protocols achieve efﬁciency and scalability via a combination of optimistic operation, versioning, and quorum-style redundancy. As such, there are a number of assumptions that guide the use of versioning and optimism. As well, there are a number of assumptions in the system model for which these protocols are designed.
Optimism and versioning
The scalability features of quorums are well-known [Malkhi et al. 2000; Naor and Wool 1998], however the use of versioning and optimism is guided by three high-level assumptions.
First, we assume that client failures within the duration of an access protocol (on the order of milliseconds) should be rare. That is, while we design to tolerate client failures (indeed, arbitrary ones; see below), our protocols optimistically presume they will not occur, and exploit this assumption heavily in order to improve throughput when it holds.