«Efficient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
For all of these systems, scalability and fault-tolerance of the metadata service are key challenges. The most common fault-tolerance solutions are agreement algorithms that perform state machine replication (e.g., using a protocol like [Bracha and Toueg 1985]). Unfortunately, such an approach does not scale as replicas are added. To make this approach scale, it is common to partition metadata across separate metadata servers (or replica sets). Unfortunately, unlike with data, this solution often comes with a visible change in semantics: loss of ability to perform atomic operations, such as rename, across directories stored on distinct metadata servers.
This chapter develops an alternate design for survivable, scalable, metadata services that maintains strong semantics. The architecture of this system is shown in Figure 5.1.
Our PASIS metadata service (PMD) is “survivable” in that it relies on few assumptions about the environment in which it runs: it is designed to withstand arbitrary (Byzantine [Lamport et al. 1982]) failures of clients and a limited number of metadata-nodes, and requires no timing (synchrony) assumptions for correctness. In addition, it is “scalable” in that the addition of new metadata-nodes yields improvements in the capacity, · 90 Efficient, scalable consistency for highly fault-tolerant storage
systems has been scaling read–write storage to improve throughput, capacity, or fault-tolerance.
The PASIS metadata service scales in a similar fashion: its throughput, capacity, or fault-tolerance can be improved by adding more metadata-nodes. Note that the metadata and storage processes can execute on the same hardware, even though the picture and most designs have them logically separated.
throughput or fault-tolerance of the service; we refer to this as “horizontal scalability”.
The PASIS metadata service is constructed of metadata objects that utilize the R/CW protocol described in Chapter 4. While the R/CW protocol focused on reading and writing entire objects, this chapter extends the R/CW protocol to allow for more general query and update operations. These operations provide access to objects at a finer-granularity (e.g., reading/inserting directory entries vs. reading/writing full directories) (Sections 5.2.1 and 5.2.2). In addition, atomic updates across multiple objects are required: e.g., moving a file from one directory to another requires that the removal and insertion be performed atomically on the source and destination directories. Thus, a single conditional write operation that can modify multiple objects and can be conditioned on a superset of these objects being unchanged is introduced in Section 5.2.3. However, these extensions do not fundamentally alter how the R/CW protocol behaves. The bounds in terms of N and the thresholds remain the same, as does the optimistic nature of the protocol and its ·
5.1 Overview 91 guarantees, all while avoiding expensive cryptography.
Moreover, by avoiding heavyweight agreement protocols, the PASIS metadata service offers horizontal scalability that has not yet been achieved for such a service. Our protocols derive from threshold voting protocols [Gifford 1979; Thomas 1979]. In such an approach, only subsets (i.e., a majority) of metadata-nodes need be accessed to complete an operation. As such, metadata-nodes can be added to improve capacity, throughput, or fault-tolerance. Moreover, the threshold voting approach employed can be extended to quorum systems that offer greater throughput scalability [Malkhi et al. 2000; Naor and Wool 1998].
5.1 Overview
Metadata objects are a type of R/CW object that provide metadata-specific interfaces.
Read operations of R/CW objects are extended to be query operations of metadata objects; read operations read the R/W object, whereas query operations return the result of a deterministic read-only function performed on the metadata object. CW operations of R/CW objects are extended to be update operations of metadata objects; CW operations send an object replica in each request, whereas update operations invoke a deterministic function on the metadata object.
Since each metadata node performs the operation on its replica, metadata objects provide replicated state machine [Schneider 1990] semantics. These semantics prevent Byzantine clients from corrupting the state of metadata objects, since all updates are veried by the metadata servers. For example, metadata-nodes can prevent a Byzantine client from inserting an existing name into a directory object, because the metadata-nodes can only be manipulated by the appropriate operations (and the results can be verified by the metadata-node).
Two optimizations have been implemented to improve the efficiency of metadata objects. First, operations can be performed on metadata objects optimistically by sending only the operation and object history set to metadata-nodes; entire objects need not be · 92 Efficient, scalable consistency for highly fault-tolerant storage transmitted across the network, thus reducing bandwidth. This is optimistic because, if the metadata-node does not host the candidate version on which the operation is to be performed, the metadata-node must re-sync its replica of the object (requiring an additional round-trip). Second, large metadata objects are broken into blocks, this aids in the reduction of bandwidth when syncing large replicas. When replica values must be fetched, only modified portions of the replica need be sent.
5.2 Metadata operations
To perform operations correctly, metadata-nodes must perform the operation on the version of the object replica that corresponds to the latest complete candidate. As in the R/CW protocol, the metadata-node requires the object history set to classify the complete candidate. As such, metadata operations build closely upon the R/CW protocol developed in the previous chapter. However, instead of shipping the data values (the results of client-side operations), the operations themselves are transmitted.
Metadata operations can be performed atomically on multiple objects. Since some operations span metadata objects, to provide failure atomicity it is necessary to perform these operations on multiple objects atomically. For example, rename removes a file from one directory object and adds it to another directory object.
This subsection describes two classes of metadata operations: query operations and update operations. As well, multi-object operations are discussed.
5.2.1 Query operations
QUERY(Operation) :
100: ObjectHistorySet, QueryResultSet := DO QUERY(Operation, QUERY LATEST, ⊥) 101: Candidate, Status := CLASSIFY(ObjectHistorySet) 102: /∗ Iterate through returned object history set ∗/ / 103: CandidateResultSet := 0 104: for all (Mi ∈ MetadataNodeSet) do 105: /∗ Add results from responses whose latest element matches the candidate ∗/ 106: if MAX[ObjectHistorySet[Mi ]] = Candidate then 107: CandidateResultSet := CandidateResultSet ∪ QueryResultSet[Mi ] 108: end if 109: end for 110: /∗ Perform voting on the set of matching data results, need b+1 matching responses ∗/ 111: Count, Data := VOTE(CandidateResultSet) 112: if (Count b + 1) then 113: /∗ If less than b+1 results match, redo the query at the candidate’s timestamp ∗/ 114: CandidateResultSet := DO QUERY(Operation, QUERY LTIME, Candidate.LT) 115: Count, Data := VOTE(CandidateResultSet) 116: end if 117: /∗ If classification yields a complete candidate, return any of the matching votes ∗/ 118: if (Status = CLASSIFIED COMPLETE) then 119: return (SUCCESS, Candidate.LT, Data ) 120: else 121: /∗ Status = CLASSIFIED REPAIRABLE, perform repair ∗/ 122: return (REPAIR OPERATION(Operation, Candidate, Candidate.LT conditioned, ObjectHistorySet)) 123: end if
Figure 5.2: Client query pseudo-code.
puted over the entire object, it can not be used to validate the data associated with a read response; instead, a voting scheme must be used.
The pseudo code for a read operation is shown in Figure 5.2. To perform a query operation, the metadata-node returns its replica history as well as the result of the query operation applied to the latest version of the object replica (cf. line 100). The client identifies the candidate by performing classification on the object history set, on line 101.
Once the candidate is classified as complete, the client must determine the result of the query operation. Since only results pertaining to the latest timestamp in a replica’s history are returned, the set of results corresponding to the candidate’s timestamp must be constructed (cf. line 107).
The client then counts the votes in this set of results, see line 111. Matching results from b + 1 metadata-nodes are sufficient “votes” for a client to use the result. Of course, more than b + 1 object histories are required to identify the latest complete candidate.
Since query operation results are returned optimistically based on the latest version hosted by the metadata-node, it is possible that no response attains a sufficient number of “votes”.
· 94 Efficient, scalable consistency for highly fault-tolerant storage If this is the case, the client performs the query metadata operation at a specific timestamp (the candidate’s timestamp); see line 114. Finally, if the candidate is classified as complete, the result of the “voting” can be returned. Otherwise, repair is needed. Since the client does not hold a full copy of the object, repair is more complicated than described in the R/CW protocol and will be discussed later in the context of multi-object operations.
As an optimization if the query operation results are large, some metadata-nodes can act as witnesses by returning a hash of the operation’s result. Voting can then be performed over the resultant set of hashes. Since the result of most query operations are small, the tradeoff between the computation time required to perform the hashing and the transmitting and comparison of the data value is in favor of the latter. (An exception may be the readdir operation as it returns a large number of directory entries).
5.2.2 Update operations
The CW operation of the R/CW protocol is extended for metadata objects to include update operations (e.g., setattr would update the attributes for an object). As in query operations, update operations do not transmit the object’s new data value; only the operation to be applied to the replica and the object history set is sent. Allowing the metadata-nodes to perform update operations locally ensures the validity of the update. As in the R/CW protocol, updates are conditioned on the latest complete candidate, which is determined through classification of the object history set. Similar to query operations, client count “votes” on the results returned from the update operations.
If a metadata-node does not host the candidate, then it cannot safely perform the operation. In this case, the metadata-node must synchronize its object replica by fetching the state associated with the latest candidate. Object replica synchronization is discussed in Section 5.3.2.
Since the client does not have a local copy of the metadata object to update, it cannot construct the timestamp of the update operation (specifically, the verifiers in the timestamp). However, the timestamp can be constructed deterministically from the object history set and the resulting value of the updated metadata object. Since metadata-nodes have ·
5.2 Metadata operations 95 all of this information, they can be relied upon to deterministically construct the timestamp of the update operation (and it can be returned as part of the result). The calculated timestamp is then inserted into the replica’s history.
5.2.3 Multi-object operations Since all metadata object replicas are stored on the same set of metadata-nodes, the metadata-nodes can locally lock the set of object replicas being operated upon. Thus, a metadata-node can perform validation for each object replica accessed by the operation, and then, only if validation passes for all objects, execute the operation. This approach of validation has similarities to the validation phase performed in optimistic concurrency control [Kung and Robinson 1981]. However, one extra step is required in the validation of multi-object update operations. To prevent malicious clients from executing different operations across different objects at different metadata-nodes, the hash of the operation (including it’s arguments and the set of objects the operation operates on) is included in the logical timestamp. This fixes the result of a multi-object update operation to a specific timestamp.
Multi-object operations complicate repair. Pseudo-code for the repair of multi-object operations is shown in Figure 5.3. If a repairable candidate is identified, then the client must request the operation that resulted in the repairable candidate (cf. line 200). Note that, since barrier-writes are always followed by an update operation, they need not be repaired. In response to the READ OPERATION query, a metadata-node returns the operation and the object replica histories for all objects updated by the operation at the specified logical timestamp. Recall, the QUERY operation requires b + 1 votes to return a result. The client constructs an object history set for each object updated by the operation by issuing a READ HISTORIES query operation containing the object history sets of interest (cf.
line 206).
Next, a check is made to verify if a barrier-write across the sets of objects is required (cf. line 207). If a barrier is required then the client performs a barrier-write conditionedon these object history sets (cf. line 208). If the barrier-write completes, the client re· 96 Efficient, scalable consistency for highly fault-tolerant storage
REPAIR OPERATION(Object, LT) :
200: RepairOperation := QUERY( READ OPERATION, Object, LT ) 201: /∗ Iterate through all objects in the returned operation ∗/ / ObjectSet := 0 202: