«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
This section describes work related to building scalable metadata services. Numerous previous systems have focused on horizontally scaling data services through the addition of storage-nodes to obtain high data throughput. However, most systems either utilize a centralized metadata service or partition metadata across a set of servers such that each piece of metadata is handled by a single metadata server. The former approach is limited in its ability to scale, and both approaches render a metadata operation susceptible to a fault or compromise of the server responsible for it.
For example, NASD [Gibson et al. 1998] and Swift [Cabrera and Long 1991] centralize access to a metadata server. IBM’s Storage Tank [Menon et al. 2003] and Lustre [Braam 2004] replace the central metadata server with a cluster of servers, partitioning metadata across the servers while supporting server fail-over. Likewise, some systems partition certain metadata structures (e.g., the manager map in xFS [Anderson et al. 1996] and the lock table in Frangipani [Thekkath et al. 1997]). Other systems make use of distributed protocols that communicate among the metadata servers to provide a replicated, fault-tolerant metadata service (e.g., Paxos [Lamport 1998] in Frangipani and BFT [Castro and Liskov 2002] in Farsite [Adya et al. 2002]) and OceanStore [Kubiatowicz et al.
2000]. Lastly, in some systems the storage-devices export interfaces directly to the client that provide serialized access to the device (e.g., device-served locks in GFS [Soltis et al.
1996] and base storage transactions by Amiri et al. [Amiri et al. 2000b]).
Survivable ﬁle systems have typically focused on the use of Byzantine fault-tolerant replication to protect the metadata service (e.g., [Deswarte et al. 1991]). Modern examples such as Farsite [Adya et al. 2002], OceanStore [Kubiatowicz et al. 2000], and BFS [Castro and Liskov 2002] employ state machine replication [Schneider 1990] for this purpose.
While a powerful paradigm, state machine replication suffers from fundamental scaling limitations. First, all service nodes process all requests, so update throughput generally does not improve with additional nodes. Second, since the message complexity in their underlying agreement protocols is θ(n2 ) with n replicas, the effect of adding nodes can be ·
2.3 Consistency semantics 21 to degrade metadata update throughput. As such, adding replicas to the service group is of limited value: the throughput of read-only operations may improve, but the throughput of update operations at best would remain constant.
Consequently, to allow the metadata service in, e.g., Farsite to scale, the ﬁle system name-space is partitioned across multiple metadata services [Adya et al. 2002]. However, partitioning the name-space introduces another difﬁculty, namely implementing metadata operations atomically across replica groups, particularly in a manner resilient to Byzantine servers and clients. We are aware of no metadata service implementation that achieves this.
Our protocols employ a different paradigm that permits better load-balancing of requests across servers and linear-or-better message complexity per client request, and thus better ability to scale throughput as new servers are added. Rather than partitioning the name-space, we implement all metadata operations with a single replica group, and scale via lighter-weight access protocols than those implementing state machine replication. In the spirit of quorum protocols [Malkhi et al. 2000; Naor and Wool 1998], our approach permits clients to involve only a subset of servers in each operation (with no server-toserver communication). In particular, each read or update operation need only execute on a subset of metadata-nodes. Since all metadata operations are served in the same replica group, our approach can implement any metadata operation atomically. Thus, our metadata objects are, in effect, replicated state machines.
Extending our conditional write protocol to send update operations and to receive operation results, rather than sending and receiving whole objects, is efﬁcient for objects with large state (e.g., directory objects). The optimistic nature of the conditional write protocol distinguishes it from other Byzantine quorum protocols. However, the protocol does not achieve the lower bound on N for implementing a Byzantine-tolerant replicated state-machine (i.e., N ≥ 4b + 1 [Malkhi and Reiter 1998a]).
The protocols developed in this thesis are most closely related to threshold-quorum systems (i.e., a majority voting system [Gifford 1979; Thomas 1979]), though our approach offers opportunities for exploring use of richer quorum constructions (e.g., [Malkhi · 22 Efﬁcient, scalable consistency for highly fault-tolerant storage and Reiter 1998a; Malkhi et al. 2000]). In a threshold-quorum system, the load [Naor and Wool 1998; Malkhi et al. 2000] on each storage-node is at least one half. This means that each storage-node must execute requests for at least one half of the operations applied to objects it hosts. True Byzantine quorum systems [Malkhi and Reiter 1998a] scale better than the one half bound. If Byzantine quorum construction techniques such as the M-Path construction [Malkhi et al. 2000] are employed, then the lower bound on load is Ω( b N ).
3 Read/Write Block Protocol This chapter describes and evaluates a new consistency protocol that operates in an asynchronous environment and tolerates Byzantine failures of clients and storage-nodes. The protocol supports a hybrid failure model in which up to t storage-nodes may fail: b ≤ t of these failures can be Byzantine and the remainder can be crash. The protocol also supports use of m-of-n erasure codes (i.e., m-of-n fragments are needed to reconstruct the data), which usually require less network bandwidth (and storage space) than full replication [Weatherspoon and Kubiatowicz 2002; Wylie et al. 2000].
Brieﬂy, the protocol works as follows. To perform a write, a client determines the current logical time (by querying a subset of the storage-nodes) and then writes timestamped fragments to at least a threshold quorum of storage-nodes. Storage-nodes keep all versions of fragments they are sent until garbage collection frees them. To perform a read, a client fetches the latest fragment versions from a threshold quorum of storagenodes and determines whether they comprise a completed write; usually, they do. If they do not, additional and historical fragments are fetched, and repair may be performed, until a completed write is observed.
The protocol gains efﬁciency from ﬁve features. First, the space-efﬁciency of m-of-n erasure codes can be substantial, reducing communication overheads signiﬁcantly. Second, most read operations complete in a single round trip: reads that observe write concurrency or failures (of storage-nodes or a client write) may incur additional work. Most studies of distributed storage systems (e.g., [Baker et al. 1991; Noble and Satyanarayanan 1994]) indicate that concurrency is uncommon (i.e., writer-writer and writer-reader shar· 24 Efﬁcient, scalable consistency for highly fault-tolerant storage ing occurs in well under 1% of operations). Failures, although tolerated, ought to be rare.
Third, incomplete writes are replaced by subsequent writes or reads (that perform repair), thus preventing future reads from incurring any additional cost; when subsequent writes do the ﬁxing, additional overheads are never incurred. Fourth, most protocol processing is performed by clients, increasing scalability via the well-known principle of shifting work from servers to clients [Howard et al. 1988]. Fifth, the protocol only requires the use of cryptographic hashes, rather than more expensive cryptographic primitives (e.g., digital signatures).
This chapter describes the protocol in detail, develops bounds for thresholds in terms of the number of failures tolerated (i.e., the protocol requires at least 2t + 2b + 1 storagenodes), and provides a proof sketch of its safety and liveness. The protocol requires at least 2t + 2b + 1 storage-nodes (i.e., 4b + 1 if t = b). It also describes and evaluates its use in a prototype storage system called PASIS [Wylie et al. 2000]. To demonstrate that our protocol is efﬁcient in practice, we compare its performance to BFT [Castro and Liskov 2001; 2002], the Byzantine fault-tolerant replicated state machine implementation that Castro and Liskov have made available [Castro and Rodrigues 2003]. Experiments show that PASIS scales better than BFT in terms of network utilization at the server and in terms of work performed by the server. Experiments also show that response times of PASIS and BFT are comparable.
3.1 System model
tine clients). We assume that Byzantine clients and storage-nodes are computationally bounded so that we can beneﬁt from cryptographic primitives (i.e., cryptographic hash functions).
The protocol is developed with a hybrid storage-node failure model [Thambidurai and Park 1988]. Under a traditional hybrid failure model, up to t storage-nodes could fail, b ≤ t of which may be Byzantine faults; the remainder could only crash. However, we consider a hybrid failure model for storage-nodes that crash and recover. The crashrecovery failure model is a strict generalization of the omission and crash failure models.
First, we review the crash-recovery model from Aguilera et al. [Aguilera et al. 2000].
In a system of n processes, each process can be classiﬁed as always-up, eventually-up, eventually-down, or unstable. A process that is always-up never crashes. A process that is eventually-up crashes at least once, but there is a time after which it is permanently up.
A process that is eventually-down crashes at least once, and there is a time after which it is permanently down. A process that is unstable crashes and recovers inﬁnitely many times. These classiﬁcations are further reﬁned: a process is good if it is either always-up or eventually-up.
We combine the crash-recovery model with the hybrid failure model as follows. Up to b storage-nodes may ever be Byzantine; such storage-nodes do not recover and are not good. There are at least N − t good storage-nodes (where b ≤ t). A storage-node that is not Byzantine is said to be benign (i.e., benign storage-nodes are either always-up, eventually-up, eventually-down, or unstable). We assume that storage-nodes have stable storage that persists throughout the crash and recovery process.
The protocol tolerates crash and Byzantine clients. As in any practical storage system, an authorized Byzantine client can write arbitrary values to storage. These writes only affect the value of the data, but do not compromise the safety (linearizability) of the object. A client that does not exhibit a Byzantine failure (it is either correct or crashes) is benign We assume an asynchronous model of time (i.e., we make no assumptions about message transmission delays or the execution rates of clients and storage-nodes, except that it · 26 Efﬁcient, scalable consistency for highly fault-tolerant storage is non-zero). We assume point-to-point authenticated channels with properties similar to those used by Aguilera et al. [Aguilera et al. 2000]. In summary, channels do not create messages (no creation), channels may experience ﬁnite duplication, and channels are fair loss. The ﬁnite duplication property ensures that if benign process p sends a message to benign process q only a ﬁnite number of times, then q receives the message only a ﬁnite number of times. The fair loss property ensures that if benign process p sends inﬁnitely many messages to good process q, then q receives inﬁnitely many messages from p.
There are two types of operations in the protocol — read operations and write operations — both of which operate on data-items. Clients perform read/write operations that issue multiple read/write requests to storage-nodes. A read/write request operates on a data-fragment. A data-item is encoded into data-fragments. Clients may encode data-items in an erasure-tolerant manner; thus the distinction between data-item and datafragment. Requests are executed by storage-nodes; a correct storage-node that executes a write request hosts that write operation.
Clients may encode data-items in an erasure-tolerant manner; thus the distinction between data-item and data-fragment. We focus here on threshold erasure codes in which any m of the n encoded data-fragments can decode the data-item. When m = 1, the replication is used. Examples of such codes are replication, Reed-Solomon codes [Berlekamp 1968], secret sharing [Shamir 1979], RAID 3/4/5/6 [Patterson et al. 1988], information dispersal (IDA) [Rabin 1989], short secret sharing [Krawczyk 1994], and “tornado” or LDPC codes [Luby et al. 2001].
Storage-nodes provide ﬁne-grained versioning; correct storage-nodes host a version of the data-fragment for each write request they execute. There is a well known zero time, 0, and null value, ⊥, which storage-nodes can return in response to read requests.
Implicitly, all stored data is initialized to ⊥ at time 0.
3.2 Mechanisms 27 Figure 3.1: Example of cross checksum generation for 5 data-fragments. To generate a cross checksum, a cryptographic hash is taken of each data-fragment. These hashes are then concatenated, replicated, and stored with each data-fragment.
3.2 Mechanisms This section describes mechanisms employed for encoding data, preventing Byzantine clients and storage-nodes from violating consistency, and authenticating client and storagenode requests. We assume that storage-nodes and clients are computationally bounded such that cryptographic primitives can be effective.
3.2.1 Erasure codes
erasure codes (e.g., Secret Sharing [Shamir 1979] and Short Secret Sharing [Krawczyk 1994]) work as well.