«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
There must be sufﬁcient good storage-nodes in the system for a CW operation by a correct client to complete. A client must terminate after it receives N − t responses. As well, up to b responses may be from Byzantine storage-nodes (who lie about executing the operation), similar to the R/W protocol (see Section 3.4.4).
However, there is another action a Byzantine storage-node can take that was not possible in the R/W protocol. A Byzantine storage-node can lie that it could not execute the · 78 Efﬁcient, scalable consistency for highly fault-tolerant storage CW request because validation failed (e.g., it hosts a CW request with a timestamp greater than LT), see Section 4.3.3. This action requires that a CW operation can complete in the face of b storage-nodes that reject the operation. If the reject message always reaches the client before some accept message from a benign storage-node, the client will always abort the CW operation (since in an asynchronous crash-recovery model it cannot await all responses). In the case that Byzantine storage-nodes do not control the network, then, probabilistically, the CW operation will eventually successfully complete.
Thus, for the CW operation to be guaranteed to complete (i.e., for a client to ensure that it is possible for QC benign storage-nodes to execute CW requests during the CW operation),
4.4.7 Improving the bounds on N and QC
If we relax the liveness guarantee provided in 4.4.5, the bounds on N and Qc can be signiﬁcantly improved. Recall, the bounds presented above arise from guaranteeing CW operations always terminate despite the false rejection of operations by Byzantine storagenodes. By relaxing this guarantee, the constraint on QC becomes:
The differences between the two sets of constraints translate to slightly different liveness properties of write operations that incur no write concurrency in the face of Byzantine storage-node faults. However, in both cases, safety is never compromised.
The ﬁrst derivation (4.6) guarantees that, in the absence of write concurrency, writes will always complete. The second derivation (4.9) provides slightly weaker liveness guarantees than the ﬁrst. It provides that in the absence of write concurrency and in the presence of Byzantine storage-node faults, writes may complete; it depends on whether responses from Byzantine storage-nodes are in the set of the N − t responses collected by the client.
since the power of a Byzantine storage-node can be mitigated through some synchrony assumptions (waiting for more responses—within some time bounds), or assumptions of a fair network (on which Byzantine entities cannot control message ordering). Under either assumption, it is likely that responses from a set of benign storage-nodes will eventually be collected, and that the operation will therefore complete. A more thorough discussion to the malicious rejection of CW operations is given in Section 4.6.3.
Read operations only return the value of the latest complete CW operation. They are linearized after the CW operation whose value they return.
CW operations only complete if they are conditioned on the latest complete CW operation. A complete CW operation is always observed as repairable or complete; if it is repairable, its value is written “forward” to a new timestamp preserving the conditionedon version chain.
4.5 Protocol scalability
Consider a failure model with t = b = 1. The smallest conﬁguration for this failure model is N = 6, COMPLETE = 5, and INCOMPLETE = 3. Larger conﬁgurations, which reduce the load on any given storage-node, are possible. For example, given the same failure model, another valid conﬁguration is N = 9, COMPLETE = 7, and INCOMPLETE = 5. In the smallest conﬁguration, each storage-node must execute requests for of the operations performed. In the larger conﬁguration, each storage-node need only execute requests for of the operations performed. Thus, it is possible to add storage-nodes to the system to increase its throughput. If 3∆ storage-nodes are added to the system to improve the · 82 Efﬁcient, scalable consistency for highly fault-tolerant storage
throughput, then the R/CW constraints (from Section 4.4.7), for ∆ 0, become:
erations for which each storage-node must execute requests.
If other quorum construction techniques are employed (e.g., the M-Path construction [Malkhi et al. 2000]), then the lower bound on load is Ω( b N ). For the scale of the prototype metadata service, the use of threshold-quorums demonstrates the beneﬁts of quorum techniques. The beneﬁts of true quorum constructions, such as the M-Path construction, are only prominent once systems are very large.
4.6 Discussion 4.6.1 Erasure codes
is responsible for the detection of poisonous writes in the R/W protocol (which utilizes erasure coding). As described in Section 3.2.3, the client detects poisonous writes through the regeneration of all N data-fragments, from which the cross checksum is recomputed and veriﬁed. For this approach to work in the R/CW protocol, the client would have to perform this validation on each value contained within the conditioned-on chain all the way back to the initial value (or until an agreed upon “correct” value, e.g., one that had been decided upon through garbage collection); this is impractical. One possibility is to use some type of veriﬁable sharing scheme (e.g., Veriﬁable Secret Sharing [Chor et al.
1985; Feldman 1987]), in which storage-nodes are able to validate the integrity of each CW operation; however, these schemes are currently very computationally and space inefﬁcient.
The other limitation of using erasure coded data is that it requires the client to compute (and erasure code) the update before transmitting it to the storage-nodes. As will be discussed in Chapter 5, the R/CW protocol is extended to support arbitrary operations that are executed solely on the storage-node; e.g., if a CW object implements a directory, the client need only transmit the name it wishes to insert—the storage-node does the work of inserting the name into the directory. This approach implements replicated state-machines, as such it is not amenable to erasure coding.
Regardless of the limitations, erasure coding within the R/CW protocol may still be useful depending on the application’s requirements. If the system model permits/prevents Byzantine clients from performing poisonous writes and if block storage is required, the R/CW protocol provides stronger semantics than does the R/W protocol.
4.6.2 Invalid authenticators
If the authenticator does not pass validation at a storage-node, the storage-node cannot tell if a Byzantine storage-node is involved, or if a Byzantine client corrupted a correct authenticator (or object history set). Digital signatures do not have this problem. As such, the storage-node can “reject” the CW request, and place the onus on the client to retry the R/CW operation using digital signatures. Other options include allowing the storage-node · 84 Efﬁcient, scalable consistency for highly fault-tolerant storage to reject the CW request outright (this gives a Byzantine storage-node the power to force a CW operation to abort) or allowing storage-nodes to perform a read operation to directly validate the object history set (this could require O(n2 ) messages in limited situations).
4.6.3 Storage-nodes rejecting CW requests
Byzantine storage-nodes may arbitrarily reject CW requests (based on the failure of object history set validation). As discussed in Section 4.4.8, a tradeoff in reducing the liveness of the protocol versus the constraints on N and QC exists. With the reduction in liveness Byzantine storage-nodes may be able to force clients to abort.
In addition to the ’solutions’ described previously (also in Section 4.4.8), i.e., the increased constraints on N and the assumption of a fair network, a third solution exists.
This solution requires storage-nodes to provide sufﬁcient evidence in the form of a valid object history set that supersedes the rejected CW request.
To provide this evidence, the storage-node must be modiﬁed in a number of ways.
First, the replica history returned by storage-nodes in response to read requests must include the client ID of the client who issued the read request. The storage-node should generate the authenticator over the ClientID, ReplicaHistory tuple. Second, upon successfully executing a CW request, the storage-node should retain the object history set on which the CW is based. Third, in response to a rejected CW request the storage-node should reply with the set of “signed” ClientID, ReplicaHistory tuples as evidence that the rejected CW request is being rejected correctly.
Unfortunately, authenticators do not allow the client to directly validate the returned replica histories; signatures do not have this problem. If using authenticators, the client must validate the returned replica histories by sending the returned ClientID, ReplicaHistory tuple to the storage-node that originally authenticated the tuple. One additional modiﬁcation must now be placed on storage-nodes; they must track the client IDs of all read operations (these logs can be purged similar to version pruning, as well this modiﬁcation is not needed if using digital signatures). This modiﬁcation is required for the storage-node to validate that the client ID present in the authenticated ·
4.7 Evaluation 85 tuple is indeed correct. The use of the client ID ensures that a Byzantine storage-nodes cannot “reuse” replica histories, sent to them from correct clients, to arbitrarily reject requests (i.e., a client must have performed the read that resulted in the replica history).
This prevents Byzantine storage-nodes from generating arbitrary, veriﬁable replica histories. Once enough replica histories have been validated the client can verify whether or not the request was rejected appropriately.
For example: with N = 3t + 2b + 1, Qc = 2t + b + 1, then COMPLETE = 2t + 2b + 1 and INCOMPLETE = t + b + 1: For a storage-node to correctly accept a CW request at time T it would have to have observed an object history set with at least COMPLETE logical timestamps of time T. Of those replica histories, validations from COMPLETE − t can be awaited. Of the validations that occur, at most b may fail due to Byzantine storage-nodes, COMPLETE − t − b pass validation. Since, COMPLETE − t − b = t + b + 1 = INCOMPLETE, Byzantine storage-nodes can make complete operations appear to be repairable. Rejecting a CW operation on the basis of hosting a repairable at a later timestamp is a valid action.
Since the R/CW protocol provides consistency more suitable to a metadata service than a block store, the majority of the evaluation is presented in the next chapter (where the R/CW protocol is extended for use in a metadata service). However, the R/CW protocol is not precluded from being used as a block based storage protocol. Thus, a brief evaluation of the protocol response time, when providing consistency for the storage of erasure coded data, is described below. Results pertaining to system throughput, concurrency, and scalability are presented in the Chapter 5.
The experimental setup is as follows. A rack of Intel P4 2.66 GHz machines with 1 GB of memory were used for the experiment. Each storage-node utilizes a dedicated
33.6 GB Seagate Cheetah 10K RPM SCSI disk. All nodes are connected through a single gigabit Ethernet switch.
Figure 4.11 shows the mean response time of read and write operations as the number · 86 Efﬁcient, scalable consistency for highly fault-tolerant storage
This is due to the rapid growth of N at b = t. As N increases, so does client computation time (to perform the encoding), as do communication costs. An increase in N also reduces network efﬁciency since smaller packets are being transmitted to a larger number of storage-nodes; recall, each data-fragment is of size m, so as N increases, so does m, thus data-fragment size is reduced.. The space-efﬁciency of both lines is almost the same;
for example, with b = t = 4: N = 21 and m = 9 the total blowup is 37.3 KB, while with b = 1,t = 4: N = 12 and m = 6 the total blowup is 32 KB.
This chapter has developed a novel protocol that provides linearizability of R/CW operations. A conditional write operation performs a write to an object only if the value of the object has not changed since the object was last read. The R/CW protocol is useful for providing consistency of updates to metadata objects, although it can also be to provide consistency of block updates. The R/CW protocol provides read–modify–write semantics.
It has been shown that many more powerful operations can be built with RMW semantics than with RW semantics (e.g., test-and-set). These operations are crucial to building fault-tolerant metadata services.
The R/CW protocol shares many features with the R/W protocol. It is designed around a hybrid fault model; it is extremely optimistic, optimized for low concurrency; and it is enabled by storage-node versioning. However, in order to fully tolerate Byzantine clients, replication must be employed. As well, the constraints on N and QC are higher. This chapter also showed that the R/CW protocol scales well as the number of faults tolerated is increased and when using erasure coded data. The next chapter extends the R/CW protocol into a query/update protocol and shows how a scalable metadata service and storage-system can be built.
· 88 Efﬁcient, scalable consistency for highly fault-tolerant storage 5 Metadata Service Scalability is a primary focus of many networked storage systems, including NASD [Gibson et al. 1998], Lustre [Braam 2004], and many recent SAN ﬁle system products. These systems all share a common design: a distinct metadata service managing a scalable collection of data storage servers. A similar high-level architecture is shared by recent research systems like Farsite [Adya et al. 2002] and Pond [Rhea et al. 2003], which logically separate metadata management from data storage.