«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
Byzantine storage-nodes can fabricate high timestamps that must be classiﬁed as incomplete by read operations. Worse, in each subsequent round of a read operation, Byzantine storage-nodes can fabricate more high timestamps that are just a bit smaller than the previous. In this manner, Byzantine storage-nodes can “attack” the performance of the read operation, but not its safety. To protect against such denial-of-service attacks, the read operation can consider all unique timestamps, up to a maximum of b + 1, present in a ResponseSet as candidates before soliciting another ResponseSet. In this manner, each “round” of the read operation is guaranteed to consider at least one candidate from a correct storage-node and no more than b candidates from Byzantine storage-nodes.
3.7.3 Garbage collection
could then delete all previous complete writes. If this occurs, the read operation’s next round will observe an incomplete write with no previous history. Effectively, the read operation has “missed” the complete write operation that it would have classiﬁed as such.
When it discovers this fact, the read operation retries (i.e., restarts by requesting a new ResponseSet). Thus, in theory, a read operation faced with perpetual write concurrency and garbage collection may never complete. In practice, such perpetual interaction of garbage collection and read-write concurrency for a given data-item is not realistic.
This chapter has developed an efﬁcient Byzantine-tolerant protocol for reading and writing blocks of data by leveraging the versioning capabilities of storage-nodes. This protocol provides read–write semantics of full data blocks. As such, it is suitable as the basis for the data storage component within a survivable storage system. The subsequent chapters develop protocols that can provide more powerful read–modify–write semantics which are more suitable for constructing metadata services.
The R/W protocol is made space-efﬁcient through the use of erasure codes and made scalable (in terms of faults tolerated) by ofﬂoading work from the storage-nodes to the clients. The protocol is work-efﬁcient, since additional overheads only occur in cases of failures or read-write concurrency. Experiments demonstrate that PASIS, a prototype block storage system that uses the R/W protocol, scales well in the number of faults tolerated, supports 60% greater write throughput than BFT, and requires signiﬁcantly less server computation than BFT.
· 56 Efﬁcient, scalable consistency for highly fault-tolerant storage 4 Read/Conditional Write Block Protocol Unlike data blocks that support read and write (R/W) operations only, metadata objects (e.g., directories), in order to preserve their integrity, require update operations that modify their existing contents, rather than those that blindly overwrite their previous contents.
For example, two concurrent insertions into a directory using write operations can result in one being overwritten by the other. To support such operations, this chapter develops a conditional write (CW) operation that performs a write to an object only if the value of the object has not changed since the client last read it. As such, we refer to these objects as read/conditional write (R/CW) objects (vs. R/W). Moreover, read and conditional write operations are linearizable, thus ensuring atomicity for those that succeed.
The focus in this chapter is on techniques we have employed in the design of R/CW objects. The protocol is developed in the context of reading and writing full objects, or blocks (as was the R/W protocol). Since storage system data rarely requires the consistency provided by R/CW objects, the next chapter develops a metadata service based on an extension to this protocol. The extension provides a more general operation based interface that allows for ﬁner-granularity access to objects.
Our R/CW protocols are designed around a hybrid fault model, in which different tolerances for Byzantine and benign (crash-recovery) failures can be speciﬁed, so that the cost of the protocol can be tuned to the number of each type of failure anticipated. The R/CW protocol, like the R/W protocol, is extremely optimistic: it is optimized for the common ﬁle system workload in which concurrent sharing is low and failures are rare.
This optimism leads to a design for the R/CW protocols that, in the common case, requires · 58 Efﬁcient, scalable consistency for highly fault-tolerant storage just a single round of communication to perform a read operation and one additional round of communication for a conditional write (that can usually be optimized away). If client failures are encountered, or concurrency is observed, more rounds of communication may be necessary. As well, expensive cryptography, notably digital signatures, is avoided.
We describe the system in terms of N storage-nodes and an arbitrary number of clients and objects. Clients perform operations on objects. Storage-nodes host object replicas. Note, this is different from R/W objects which can use erasure coding for space-efﬁciency. The impact of using erasure coding is discussed in Section 4.6.1.
The R/CW protocol is comprised of read operations and CW operations. Both read operations and CW operations issue requests to sets of storage-nodes. CW requests are executed by storage-nodes. As in the R/W protocol, logical timestamps are used to totally order all CW operations and to identify CW requests from the same CW operation across storage-nodes. Each storage-node maintains a replica history, denoted ReplicaHistory, for each object it hosts. The replica history contains the entire set of CW requests executed on the object replica, ordered by logical timestamp (pruning the replica history is discussed in Section 4.3).
4.1.1 R/CW semantics
In this operation, the register X is read, an operation f is performed on X, and the resulting transformation is stored back in X. It has been shown that many other more powerful operations can be implemented as an RMW operation; e.g., test-and-set, fetchand-add, etc. In a CW operation, an update of X is performed only if the value of X has not changed since it was previously read; thus, the write is conditioned-on the previously read value not having changed.
As in the R/W protocol, the protocol is optimistic, thus an operation may be complete, incomplete, or repairable. Every CW operation is preceded by a read operation that identiﬁes the latest complete candidate. A CW operation is conditioned-on the latest complete candidate. R/CW semantics ensure that a single conditioned-on chain exists from any candidate back through all previous complete candidates to the initialized object. Therefore, no two complete writes can be conditioned on candidates such that the links formed between the complete write and the conditioned-on candidate overlap. Figure 4.1 more clearly illustrates the conditioned-on chain. R/CW semantics ensure that the conditionedon time of the current candidate “points” at a complete candidate or at a well-known initial value. As well, since complete CW operations may only be observed as repairable, all repairable candidates must be repaired to maintain the integrity of the conditioned-on chain.
One of the main challenges of the R/CW protocol is to protect the integrity of the conditioned-on chain, especially from Byzantine clients. Byzantine clients can not be trusted to follow the protocol, thus they may attempt to break the conditioned-on chain by conditioning on an incorrect value (e.g., an incomplete CW operation or not the latest complete CW operation), see Figure 4.1(b). Brieﬂy, to prevent these Byzantine attacks, · 60 Efﬁcient, scalable consistency for highly fault-tolerant storage
Figure 4.1: Examples illustrating the conditioned-on chain.
For each example, the complete object history and each version’s classiﬁcation for a CW object is shown. As well, the logical timestamp for each version is given, as is the logical time for the version on which it conditions (shown as a subscript to the logical timestamp). Note, the CW operation at logical time 3 is incomplete. Since both the version at logical time 2 and 3 condition on the version at time 1 (e.g., the may have been concurrent), only one version can complete successfully. Example (a) shows a valid conditioned-on chain. The version at logical time 3 is ignored. Example (b) shows an invalid conditioned-on chain. At logical time 4, a (Byzantine) client incorrectly conditions on the version at time 3 even though it is incomplete, thus corrupting the conditioned-on chain.
clients must send “proof” with each CW operation supporting their action. Each client sends an object history set for each CW object being updated. The object history set contains the replica history of each storage-node that replied during the read phase. As well, each of the replica histories is “signed” (digital signatures may be used, but for performance we use authenticators, see Section 4.2.2) and acts as proof that the client is acting correctly. The “signed” object history set can then be validated by individually by each storage-node (this validation is discussed in detail in Section 4.3.3).
4.1.2 Read operation overview
the replica history in response to a read request.
The client combines the replica histories returned by the storage-nodes into the object history set (denoted ObjectHistorySet). For example, ObjectHistorySet[S] contains the replica history returned from storage-node S. Classiﬁcation is performed on the timestamps within the object history set. The purpose of classiﬁcation is to determine the timestamp of the latest complete (successful) update. A CW operation is complete once a threshold number of benign storage-nodes have executed CW requests. This threshold permits the R/CW protocol to ensure that no subsequent operation can return (or condition-on) a previous object value; as well, it deﬁnes classiﬁcation. Classiﬁcation identiﬁes a candidate—a candidate is either complete or repairable. If the candidate is complete, then the read operation returns the object value associated with the candidate.
If the candidate is repairable, the client performs a CW operation to repair the candidate.
Once the repairable candidate is complete, its value is returned.
Aspects of the read operation are illustrated in Figure 4.2. In response to read requests, storage-nodes return object histories to the clients. In the example, each client constructs · 62 Efﬁcient, scalable consistency for highly fault-tolerant storage a different object history set since each client received responses from a different subset of storage-nodes. Classiﬁcation is performed on the object history set. In the example, client A and B classify the candidate with logical timestamp 5 as repairable and complete respectively. Client B classiﬁed the candidate with logical timestamp 6 as incomplete prior to classifying 5 as complete. The exact rules for classiﬁcation are given in Section 4.4.
4.1.3 CW operation overview
All CW operations are preceded by a read operation that identiﬁes the candidate. Recall, a CW operation is conditioned-on the latest complete candidate (actually, on the object history set for which classiﬁcation yields the candidate). R/CW semantics ensure that a single conditioned-on chain exists from any candidate back through all previous complete candidates to the initialized object. Each entry in a replica history is a logical timestamp, conditioned-on logical timestamp, value tuple. The elements of this tuple are denoted LT, LT conditioned, Data and replica histories are initialized to 0, 0, ⊥.
The largest timestamp in the object history set is used by the client to create a timestamp for the CW operation. As discussed later, hashes of the object history set and the object value are also placed in the timestamp (these hashes ensure that all CW operations have a unique timestamp and protect against Byzantine entities). The client sends CW requests to all storage-nodes. The CW request contains the timestamp of the CW operation, the object history set constructed by the preceding read operation, the candidate (found from classiﬁcation of the object history set), and the object value.
Correct storage-nodes execute a CW request only if the timestamp, value, and replica history can all be validated. Validation ensures failure atomicity, concurrency atomicity, and, as discussed below, protects against Byzantine entities. If sufﬁcient storage-nodes execute CW requests, the CW operation completes; otherwise, it aborts. CW operations may abort due to concurrency. Since repair is a CW operation, and may be part of a read operation, read operations may also abort.
To ensure that only one CW operation completes that is conditioned upon another CW operation, a barrier may be written to ensure that outstanding concurrent CW operations ·
4.2 Mechanisms 63 cannot complete. Barriers mark a point in time without inserting a value into the system.
Thus, a completed barrier written infront of an incomplete value prevents (or bars) the incomplete operation from ever completing; storage-node validation will fail since the barrier’s timestamp is larger than the timestamp being conditioned on by the incomplete operation. Barriers allow CW operations to be issued that may complete in the face of concurrency or client failure.
4.2 Mechanisms This section describes various mechanisms employed to guarantee safety within the the R/CW protocol.
4.2.1 Validating timestamps Logical timestamps are structured values with three members: the primary timestamp (Time), the object history set veriﬁer (Veriﬁer OHS), and the value veriﬁer (Veriﬁer Data).
The veriﬁers are collision-resistant hashes over the object history set and the object’s value respectively. In comparing two logical timestamps, to determine which is greater, the major timestamps are ﬁrst compared, and then the veriﬁers are compared. Since veriﬁers are guaranteed to be unique (for unique object values) all timestamps are guaranteed to be unique.