«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
Alternately, more space-efﬁcient encoding schemes can be used. This section provides an overview of some of the more well-known schemes, such as RAID, and some other more general, more space-efﬁcient erasure coding schemes. With these schemes, reads require fragments from multiple servers. Moreover, to decode the data-item, the set of fragments read must correspond to the same write operation; thus write–write concurrency can be problematic.
In order to increase the performance of disk subsystems, data can be striped across a set of disks. However, as the number of disks in each stripe is increased, the likelihood of a single disk dying and the probability of data loss increases. In 1988 Paterson, et al. [Patterson et al. 1988] designed Redundant Arrays of Inexpensive Disks (RAID) to overcome these reliability challenges. They solved the problem by storing redundant data in the form of parity on one or more of the disks in the array.
At the present time there are a number of different RAID levels. The most common · 14 Efﬁcient, scalable consistency for highly fault-tolerant storage are RAID 0, 1, and 5. Combinations of these levels also exist (e.g., RAID 10). RAID 0 is the simplest, increasing performance by striping data across a set of devices, so that they can be read and written in parallel. However, RAID 0 provides no extra redundancy.
RAID 1 provides mirroring of data onto two devices. This scheme can tolerate a single device failure, however it pays a huge cost in storage capacity—only half of the space is usable to store data. RAID 5 uses parity with striping to improve space-efﬁciency. Like in RAID 0, data is striped across a set of devices. A parity code is calculated simply by performing an XOR over the blocks within each stripe and is stored on separate device.
In RAID 5 the parity block is rotated among the set of the storage devices.
The use of erasure codes can greatly improve the space-efﬁciency of replicating data.
Erasure codes were originally developed for communication channels in the networking community and are sometimes known as forward-error correcting codes. Erasure codes encode a data-item into a set of fragments and have the property that any subset of a certain size can be used to reconstruct the original data-item. They have the nice property that they can tolerate m simultaneous failures with only m extra data-fragments. RAID schemes can typically only support m = 1, 2; i.e., they can only tolerate a single or double disk failure.
In this work the focus is on systematic threshold erasure codes in which any m of the n encoded data-fragments can decode the data-item. The ﬁrst m data-fragments are stripes of the data-item. The remaining n − m code-fragments are generated using polynomial interpolation within a Galois ﬁeld. As such, each fragment is the total data size.
m n Thus, the total size-blowup is m. Replication can be thought of a subset of these m-of-N erasure codes, as could the different RAID schemes; m = 1 for replication. Examples of such codes are Reed-Solomon codes [Berlekamp 1968], secret sharing [Shamir 1979], information dispersal (IDA) [Rabin 1989], short secret sharing [Krawczyk 1994], and “tornado” codes [Luby et al. 2001]. The tradeoff in using erasure codes over RAID like schemes is the performance cost in generating the code fragments.
2.3 Consistency semantics 15
An example 2-of-5 erasure code scheme is shown in Figure 2.1. The original dataitem is striped into 2 fragments, with 3 code fragments being generated. Each fragment is written to a storage-node. Any 2 fragments can be used to decode the original data-item.
There exists much prior work (e.g., [Agrawal and El Abbadi 1990; Herlihy and Tygar 1987; Mukkamala 1994]) that combines erasure coded data with quorum systems to improve the conﬁdentiality and/or integrity of data along with its availability. However, these systems do not provide consistency (i.e., a synchronization mechanism is required) and do not cope with Byzantine clients. Concurrently with our own work, Frølund et al. [Frølund et al. 2004] have developed a decentralized protocol for linearizable erasure coded read/write registers that utilize a variant of threshold-quorums.
2.3 Consistency semantics To provide reasonable semantics, storage systems must guarantee that readers see consistent data-item values.
2.3.1 Linearizability The linearizability of operations is desirable for block-based read-write storage. Since linearizability is only deﬁned for single object operations, it is not suitable for describing multi-object operations that are sometimes required for metadata updates. Linearizability is described by Herlihy and Wing in [Herlihy and Wing 1990]. Operations are lineariz· 16 Efﬁcient, scalable consistency for highly fault-tolerant storage able if their return values are consistent with an execution in which each operation is performed instantaneously at a distinct point in time between its invocation and completion. Frølund et al. [Frølund et al. 2004] have recently developed a block-based protocol that provides a variant of linearizability they term strict linearizability [Aguilera and Frolund 2003], in which an operation that crashes either takes effect within some limited time frame or not at all.
The R/W protocol, described in Chapter 3, tolerates Byzantine faults of any number of clients and a limited number of storage nodes while implementing linearizable [Herlihy and Wing 1990] and wait-free [Herlihy 1991] read-write objects. As well, the R/CW protocol, described in Chapter 4 also implements linearizable objects. In this protocol, linearizability is adapted appropriately for Byzantine clients, and wait-freedom as in the model of Jayanti et al. [Jayanti et al. 1998]. Since operations performed by Byzantine clients have no clear start time, they are excluded from the set of linearizable operations.
The consistency semantic of serializability can pertain to multi-object operations that are required for updating metadata objects atomically. Traditionally serializability has been deﬁned for transactions within database systems. A sequence of transactions are serializable if their outcome is equivalent to some sequential execution of the individual transactions [Papadimitriou 1979]. Strict serializability extends serializability to ensure that transactions already in the history in serial order (i.e., they have completed), remain in that relative order. This provides a consistency semantic similar to that of linearizability.
A serializable execution satisﬁes the ACID properties [Haerder and Reuter 1983] (i.e., atomicity, consistency, isolation, and durability). The serializability of transactions can be ensured through a number of techniques. Typical techniques include: two-phase locking [Gray et al. 1976], in which locks are acquired in one phase and released in a separate phase; optimistic concurrency control [Kung and Robinson 1981], in which operations within a transaction are performed optimistically, with no locking, and validation of serializability is done at commit time; and timestamp ordering [Bernstein et al. 1980], in these ·
2.3 Consistency semantics 17 protocols timestamps are used to order operations. The query/update protocol described in Chapter 5 implements metadata objects that conform to strict serializability.
2.3.3 Tolerating benign faults To provide operation atomicity, concurrency and client failures must be tolerated. A challenge introduced by concurrency and client failures is partially completed write operations. Partial writes arise from both write operations in progress and write operations that never completed (e.g., failed client).
Common approaches to dealing with partial writes in non-Byzantine-tolerant systems are two-phase commit [Gray 1978] and repair (write-back). Two-phase commit provides failure atomicity (although such protocols may block). Three phase commit protocols [Skeen and Stonebraker 1983] provide failure atomicity without blocking by utilizing failure detectors and/or recovery mechanisms. Alternately, many non-Byzantine-tolerant systems (e.g., Harp [Liskov et al. 1991] and Petal [Lee and Thekkath 1996]) serialize their actions through a primary storage-node, which becomes responsible for completing the update.
A common approach to dealing with concurrency is to suppress it, either via leases [Gray and Cheriton 1989] or optimistic concurrency control [Kung and Robinson 1981]. Ensuring operation atomicity in the face of Byzantine failures of clients requires additional work.
An alternate approach to handling both partial writes and concurrency is to have the data stored on storage-nodes be immutable [Reed and Svobodova 1980; Reed 1983]. By deﬁnition, this eliminates the difﬁculties of updates for existing data. In doing so, it shifts the problem up one level; an update now consists of creating a new data-item and modifying the relevant name to refer to it. Decoupling the data-item creation from its visibility simpliﬁes both, but making the metadata service fault-tolerant often brings back the same issues.
For example, SWALLOW [Reed and Svobodova 1980] utilizes immutable object version logs (or histories) ordered by pseudo time to guarantee strong consistency (i.e., serial· 18 Efﬁcient, scalable consistency for highly fault-tolerant storage izability) of arbitrary sets of read/write operations performed on a set of objects. As well, the Amoeba File Server [Mullender 1985] utilizes immutable data versions to implement optimistic concurrency control, such that the ﬁle system is always kept in a consistent state. More recently, peer-to-peer systems (e.g., Past [Rowstron and Druschel 2001] and CFS [Dabek et al. 2001]), Farsite, and the archival portion of OceanStore [Kubiatowicz et al. 2000] use immutable versions of data to simplify serialization of access to data.
Other systems, such as Ivy [Muthitacharoen et al. 2002], use immutable version logs containing both data and metadata, however Ivy does not implement strong consistency guarantees for its metadata (or data) in this fashion.
Frølund et al. [Frølund et al. 2004] recently developed a decentralized consistency protocol for erasure coded data. Their algorithm relies on a quorum construction similar to threshold-quorums that they call “m-quorums” (any two quorums intersect in m processes). They utilize client generated timestamps to totally order updates and utilize server-side logs to track outstanding requests. Also, as described earlier, their protocol provides a variant of linearizability they call strict linearizability [Aguilera and Frolund 2003]. However, their protocol does allow for read and write operations to abort, as such they forgo strong liveness guarantees.
2.3.4 Tolerating Byzantine faults
Byzantine fault-tolerant protocols for implementing read-write objects using quorums are described in [Herlihy and Tygar 1987; Malkhi and Reiter 1997; Martin et al. 2002; Pierce 2001]. Of these related quorum systems, only Martin et al. [Martin et al. 2002] achieve linearizability in our fault model, and this work is also closest to ours in that it uses a type of versioning. In our protocol, a reader may retrieve fragments for several versions of the data-item in the course of identifying the return value of a read. Similarly, readers in [Martin et al. 2002] “listen” for updates (versions) from storage-nodes until a complete write is observed. Conceptually, our approach differs by clients reading past versions, versus listening for future versions broadcast by servers. In our fault model, especially in consideration of faulty clients, our protocol has several advantages. First, our protocol works for ·
2.3 Consistency semantics 19 erasure-coded data, whereas extending [Martin et al. 2002] to erasure coded data appears nontrivial. Second, ours provides better message efﬁciency: [Martin et al. 2002] involves a Θ(N 2 ) message exchange among the N servers per write (versus no server-to-server exchange in our case) over and above otherwise comparable (and linear in N) message costs. Third, ours requires less computation, in that [Martin et al. 2002] requires digital signatures by clients, which in practice is two orders of magnitude more costly than the cryptographic transforms we employ. Advantages of [Martin et al. 2002] are that it tolerates a higher fraction of faulty servers than our protocol, and does not require servers to store a potentially unbounded number of data-item versions. Our prior analysis of versioning storage, however, suggests that the latter is a non-issue in practice [Strunk et al.
2000], and even under attack this can be managed using a garbage collection mechanism we describe in Section 3.6.
A metadata service, like any deterministic service, can be implemented in a survivable fashion using state machine replication [Schneider 1990], whereby all operations are processed by server replicas in the same order (atomic broadcast). While this approach supports a linearizable, Byzantine fault-tolerant implementation of any deterministic object, such an approach cannot be wait-free [Fischer et al. 1985; Herlihy 1991; Jayanti et al. 1998]. Instead, such systems achieve liveness only under stronger timing assumptions, such as synchrony (e.g., [Cristian et al. 1995; Pittelli and Garcia-Molina 1989; Shrivastava et al. 1992]) or partial synchrony [Dwork et al. 1988] (e.g., [Castro and Liskov 2002; Kihlstrom et al. 2001; Reiter and Birman 1994]), or probabilistically (e.g., [Cachin et al. 2001]). An alternative is Byzantine quorum systems [Malkhi and Reiter 1997], from which our protocol inherits techniques (i.e., our protocol can be considered a Byzantine quorum system that uses the threshold quorum construction). Protocols for supporting a linearizable implementation of any deterministic object using Byzantine quorums have been developed (e.g., [Malkhi et al. 2001]), but also necessarily forsake wait-freedom to do so. Additionally, most Byzantine quorum systems utilize digital signatures which are computationally expensive.
· 20 Efﬁcient, scalable consistency for highly fault-tolerant storage
2.3.5 Metadata scalability