«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
ﬁcation over the new object history set, line 416. This is necessary, since barrier-writes may be executed at storage-nodes hosting the CW operation that they are trying to block (i.e., an incomplete operation may complete just prior to the execution of the barrier). If classiﬁcation yields the same candidate (cf. line 417), the client can perform the valuewrite. And, if the value-write completes, then the Candidate is returned (in case the CW operation is called for repair). If either the barrier-write or value-write do not complete, then the CW operation aborts, and must be retried (including the read phase). Figure 4.6 shows an example of a barrier-write that necessitates re-classiﬁcation.
4.3.3 CW requests at storage-nodes Receiving CW requests
RECEIVE WRITE REQUEST(LT, LT conditioned, ObjectHistorySet, Data) :
600: if (VALIDATE CW REQ(LT, LT conditioned, ObjectHistorySet, Data)) then 601: /∗ Execute the CW request ∗/ 602: Request := LT, LT conditioned, Data 603: ReplicaHistory := ReplicaHistory ∪ Request 604: /∗ Prune history ∗/ 605: if (Data = ⊥) then 606: PRUNE HISTORY SET(ReplicaHistory, LT conditioned ) 607: end if 608: SEND(WRITE RESPONSE, S, ReplicaHistory, SUCCESS) 609: return 610: end if 611: SEND(WRITE RESPONSE, S, ReplicaHistory, FAIL) 612: return
Figure 4.7: Reception of a CW at storage-node S.
into the storage-node’s local history, line 603. The request is comprised of the tuple:
LT, LT conditioned, Data.
If an object history set contains a complete candidate, the storage-node can prune its replica history up to the complete candidate’s timestamp (cf. line 606). A validated object history set always contains a candidate that is complete (although such a candidate may be earlier in the object history set than a repairable candidate returned from classiﬁcation); note, the initial timestamp 0 can be considered the earliest complete candidate. All versions prior to the latest complete (non-barrier) CW (i.e., prior to the CW’s LT conditioned ) can be pruned; however, no pruning occurs on barrier-writes. The use of authenticators obviates the need for distributed garbage collection as was necessary in the RW protocol.
Finally, a response is sent back to the client. Note that the storage-node’s local history is always transmitted in the response. This allows the client to update the object history set and re-perform classiﬁcation if necessary (e.g., if the CW operation failed).
Validating CW requests
Storage-nodes must validate CW requests. If validation succeeds, the storage-node executes the CW request: it executes the value-write and updates its replica history to include the CW request. Remember, the replica history is stored in stable storage that endures the crash-recovery cycle. If validation fails, the storage-node rejects the CW request.
With the object history set, the storage-node can perform the exact same logic as the · 72 Efﬁcient, scalable consistency for highly fault-tolerant storage
VALIDATE CW REQ(LT, LT conditioned, ObjectHistorySet, Data) :
700: /∗ Validate LT.Time, ensures logical time is always increasing ∗/ 701: if (LT.Time = ((MAX TIMESTAMP[ObjectHistorySet]).Time + 1)) then 702: return (FAIL) 703: end if 704: /∗ Validate authenticators for each history set ∗/ 705: if (VALIDATE AUTHENTICATORS(ObjectHistorySet) = FAIL) then 706: return (FAIL) 707: end if 708: /∗ Validate veriﬁers ∗/ 709: if (HASH(Data) = LT.Veriﬁer Data) then 710: return (FAIL) 711: else if (HASH(ObjectHistorySet) = LT.Veriﬁer OHS) then 712: return (FAIL) 713: end if 714:
715: /∗ Perform classiﬁcation to ﬁnd latest logical timestamps ∗/ 716: LTlatest comp nb, LTlatest comp nb bd, LTlatest nb, LTlatest barrier, LTlatest ni nb, Classifylatest barrier := CLASSIFY(ObjectHistorySet) 717:
718: /∗ Check if a barrier is needed, and not writing a barrier ∗/ 719: if ((LTlatest comp nb bd = LTlatest nb ) AND ((LTlatest barrier LTlatest nb ) OR (Classifylatest barrier = COMPLETE)) then 720: /∗ Barrier is needed, make sure this is a barrier ∗/ 721: if (LT.Barrier = FALSE) then 722: return (FAIL) 723: end if 724: else if (LT.Barrier = TRUE) then 725: return (FAIL) 726: end if 727:
728: /∗ Validate replica acceptance policy ∗/ 729: if (MAX[ReplicaHistory] MAX[LTlatest nb, LTlatest barrier ]) then 730: return (FAIL) 731: end if 732:
733: if (LT.Barrier = FALSE) then 734: /∗ Validate condition on relationship, conditioning on a complete∗/ 735: if ((LTlatest nb = LTlatest comp nb ) AND (LT conditioned = LTlatest comp nb )) then 736: /∗ Classiﬁed complete as latest timestamp, but LT conditioned not conditioned on latest complete ∗/ 737: return (FAIL) 738: end if 739:
740: /∗ Validate condition on relationship, performing repair ∗/ 741: if (((LTlatest nb = LTlatest comp nb ) AND (LTlatest nb = LTlatest ni nb )) AND (LT.Veriﬁer Data = LTlatest ni nb.Veriﬁer Data) OR (LT conditioned = LTlatest comp nb bd )) then 742: return (FAIL) 743: end if 744: end if 745: return (SUCCESS)
Logical timestamps returned from classifying the object history set on the storage-node:
• LTlatest barrier : Latest barrier logical timestamp;
• LTlatest nb : Latest non-barrier logical timestamp;
• LTlatest ni nb : Latest non-incomplete, non-barrier logical timestamp;
• LTlatest comp nb : Latest complete, non-barrier logical timestamp;
• LTlatest comp nb bd : Latest complete, non-barrier, by deduction logical timestamp.
Note: LTlatest comp nb bd = MAX[LTlatest comp nb, LTlatest ni nbconditioned ]
Figure 4.9: Logical timestamps returned from classiﬁcation.
line 701. Next, the authenticators for the object history set is validated (cf.
line 705). Recall, the object history set is comprised of history sets from each of the storage-nodes queried during the read operation phase. Each of these history sets has a corresponding authenticator that must be validated. The impact of failed authenticator validation is discussed in Section 4.6.2.
Next, the veriﬁers, stored within the timestamp, are validated (cf. lines 708 - 713).
Again, the storage-node is replicating client logic to ensure that the validating timestamp is well-formed. The object history set veriﬁer determines the conditioned-on relationship for a the CW operation and the value veriﬁer determines the value of the CW operation, thus limiting the actions a Byzantine client can perform.
By performing classiﬁcation on the object history set, line 716, the storage-node can validate that the candidate is the correct candidate, and that the timestamp is the correct timestamp (since both are deterministic given an object history set). As well, the storagenode can check to see if a barrier is required (cf. line 719). To make these checks, storagenode classiﬁcation returns multiple logical timestamps, each set to a logical timestamp within the object history set. These timestamps are described within Figure 4.9.
The descriptions of most of the timestamps returned from classiﬁcation are clear.
However, the LTlatest comp nb bd timestamp deserves further discussion. As described, this timestamp represents a version that is complete by deduction. A candidate that is complete by deduction is a candidate that is necessarily complete, but has not been observed as such. This occurs when a repairable (non-barrier) candidate A is observed (in the object history set) to have conditioned upon another, earlier, repairable (non-barrier) candidate B.
Recall, a complete candidate may be viewed as repairable. Since A is repairable, it has · 74 Efﬁcient, scalable consistency for highly fault-tolerant storage passed validation at ≥ QC − t storage-nodes, of which at least QC − t − b are benign (see the constraint derivations for more details, Section 4.4). The only possible way that A can pass validation at a benign storage-node is by conditioning on a complete (non-barrier) candidate. Thus, B must be complete (by deduction).
Barriers are required to squash pending CW requests (e.g., incomplete CW operations from failed clients). A barrier is required if there exists any non-barrier, non-complete CW request that exists within the object history set at a later logical timestamp than any other complete CW write or barrier operation (i.e., there exists a pending incomplete CW operation). If the Barrier portion of the logical timestamp is set to TRUE, the storagenode knows that the CW request is part of a barrier-write. The only further validation that occurs for barrier-writes is that of the acceptance-check against the replica’s history.
The acceptance-check, line 728 ensures that a CW request has not been executed locally that is later than the latest logical timestamp present within the object history set.
If the CW is not a barrier-write, the conditioned-on relationship is then validated (cf. lines 734 - 743). There are two cases for validation. First, a complete candidate is classiﬁed as the latest non-barrier. In this case, the conditioned-on timestamp is veriﬁed against the latest complete, non-barrier CW (cf. line 735).
Second, a repairable candidate is classiﬁed as the latest non-barrier. That is, the latest non-barrier is the same as the latest non-incomplete, non-barrier operation and the latest non-barrier is not the latest complete, line 741. If the candidate is repairable, two validations are performed. First, the storage-node validates that the object value for the CW request is the same as the repairable candidates (this ensures that the repair occurs correctly). Second, the CW’s conditioning on timestamp is checked against the repairable’s conditioning on timestamp (note, in this case LTlatest comp nb bd = LTlatest ni nbconditioned );
this ensures the continuity of the conditioning chain.
A summary of the conditions that hold if storage-node validation succeeds is presented in Figure 4.10. These are derived from the pseudo-code shown in Figure 4.8.
4.4 Constraints 75
4.4 Constraints This section presents bounds on N and QC, the deﬁnition of a complete CW operation (in terms of t and b), and constraints on COMPLETE and INCOMPLETE. The derivation of constraints is similar to that for the protocol for R/W objects in Section 3.4. The differences in bounds arise mainly from the added constraint that repairable candidates must intersect with complete candidates for the R/CW protocol to be safe. A formal proof of the safety and liveness of the R/CW protocol and its extension to the Q/U protocol is given in [Abd-El-Malek et al. 2004].
4.4.1 Read classiﬁcation rules
Recall that the candidate is the data-item version, returned by a read request, with the greatest logical timestamp. The set of read responses that share the candidate’s timestamp are the candidate set. Constraints on COMPLETE and INCOMPLETE are required to ensure two properties. The ﬁrst property is that if a candidate is ever classiﬁed as complete, then any subsequent read operation observes the complete candidate as repairable. The second property is that if candidate A is complete and conditioned-on candidate B, any repairable candidate with a timestamp greater than A either conditioned-on A or can traverse the conditioned-on chain back to A.
To classify a candidate as complete, a candidate set of at least QC benign storagenodes must be observed. In the worst case, at most b members of the candidate set may be Byzantine, thus, |CandidateSet| − b ≥ QC ⇒ COMPLETE = QC + b. (4.1) · 76 Efﬁcient, scalable consistency for highly fault-tolerant storage To classify a candidate as incomplete, the candidate must be incomplete (i.e., fewer than QC benign storage-nodes have executed the CW). We consider a rule for classifying incomplete candidates that takes advantage of N − t responses from storage-nodes. In the crash-recovery model, eventually, a client is guaranteed to receive this many responses— even though, there may be periods during which more than t storage-nodes are crashed.
Moreover, a client cannot expect more than this many responses, since up to t storagenodes may never recover (and in an asynchronous environment crash failures are undetectable). Thus, the rule for classifying a candidate incomplete is, |CandidateSet| + t QC ⇒ INCOMPLETE = QC − t. (4.2) Candidates that cannot be classiﬁed as complete or incomplete are classiﬁed as repairable.
Given these constraints on COMPLETE and INCOMPLETE, consider the examples illustrated in Figure 4.2. In this example, t = 1 and b = 0, so COMPLETE = 3, INCOMPLETE = 2, and N = 4. Since client A observes a candidate with timestamp 5 in two replica histories, and this is not less than the incomplete threshold, it classiﬁes 5 as repairable. Whereas candidate 6 is observed by client B in one replica history and classiﬁed incomplete.
4.4.2 Real repairable candidates
This property ensures that colluding Byzantine storage-nodes are unable to fabricate a candidate that a correct client deems repairable. To achieve this property, a candidate set of size b must be classiﬁable as incomplete. Substituting |CandidateSet| = b into (4.2),
didate may be observable as repairable, a client may observe two repairable candidate’s (even though one is complete) and not know which candidate to repair. If the “wrong candidate” (i.e., the one that is not complete) is repaired, the condition on chain is violated.
To achieve this property, it is necessary that a complete candidate and repairable candidate intersect at at least one benign storage-node. Thus, from the lower bounds of repairable and complete candidates,
4.4.4 Read set intersection The intersection between a complete and a candidate set of QC benign storage-nodes must result in at least a repairable being observed. Thus,
4.4.5 CW termination A CW operation is deﬁned to be complete once a total of QC benign storage-nodes have executed the CW.