«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
203: for all (Oi ∈ RepairOperation) do 204: ObjectSet := ObjectSet ∪ Oi 205: end for 206: ObjectHistorySets := QUERY( READ HISTORIES, ObjectSet ) 207: if (BARRIER NEEDED(ObjectHistorySets) = TRUE) then 208: ObjectHistorySets, Status := UPDATE( BARRIER OPERATION, ObjectSet, ObjectHistorySets) 209: if (Status = FAIL) then 210: return FAIL 211: end if 212: end if 213: Candidate, Status := REPAIR NEEDED(ObjectHistorySets) 214: if (Status = FALSE) then 215: return (FAIL) 216: end if 217: Status := UPDATE(RepairOperation, ObjectHistorySets) 218: return (Status)
Figure 5.3: Client multi-obj repair pseudo-code.
constructs the object history sets and performs reclassiﬁcation to ensure that repair is still required. A few cases exist in which repair is not required—the most obvious is when the operation being repaired has completed (or has been repaired by another client).
More subtle cases are discussed a little later. If the client determines that repair is still required, an update operation corresponding the operation that is to be repaired is performed, line 217.
As mentioned repair may not be necessary for a few reasons. Usually, it will be the case that the histories returned from the barrier-write indicate that the RepairOperation has completed (or been repaired by another client). However, it is possible for some objects involved in a multi-object operation to be classiﬁed as repairable and others to be classiﬁed as incomplete (depending on the client’s system view). If this is the case, it is not possible to repair all the candidates involved in the multi-object operation. The client can deduce, since some candidates involved in the multi-object operation are incomplete, that no candidate involved in the multi-object operation is complete (thus safely reclassifying the repairable candidate as incomplete). Likewise, by sending the set of object history sets for all objects updated by the multi-object operation to the metadata-nodes, the metadata-nodes can reach the same conclusion and allow such repairable candidates ·
5.2 Metadata operations 97
Figure 5.4: Example of multi-object repair.
For this setup: N = 4, COMPLETE = 3, INCOMPLETE = 2. (a) Initially an update operation is performed on Objects A and B. However, it fails part-way through. (b) A read history operation is issued to read the object history set associated with Object A. Logical timestamp 1 is classiﬁed as incomplete. (c) An update is performed, and completes, on Object A at logical time 2 (conditioned on timestamp 0). (d) A query operation is performed on Object B. Logical timestamp 1 is classiﬁed as repairable, thus repair must be performed. First, the operation resulting in the version at timestamp 1 must be read. It returns the original operation that updated Objects A and B. The history of Object A is then read and classiﬁed. Object A’s timestamp 2 is classiﬁed as complete. From this, the client can deduce that Object B’s version at time 1 could never have completed (otherwise Object A’s version at 2 would have conditioned on time 1).
to be over-written.
Similarly, it may be the case that some objects (in a multi-object operation) appear as complete, while others appear incomplete or repairable. Again, the client and metadatanodes can come to the same conclusion by examining the object history sets of each object involved in the repair. Figure 5.4 shows an example of how multi-object repair works.
This section as described a number of extensions to the R/CW protocol that enables query and update operations to be performed against metadata objects. Instead of clients transmitting entire objects, query and update operations can be used to perform operations that · 98 Efﬁcient, scalable consistency for highly fault-tolerant storage
Figure 5.5: Example metadata operations.
Metadata objects support query and update operations. Both types of metadata operations may need to perform repair. The requests a client sends to metadata-nodes to perform metadata operations are shown. (a) Query operations may complete in a single round trip. If insufﬁcient results are returned at the candidate’s timestamp, a second request is sent to collect results at the candidate’s timestamp. (b) For update operations, a client ﬁrst requests replica histories to identify the candidate. Then, the client issues the update operation with an object history set constructed from replica histories returned by a recent query. (c) To perform repair a client requires the operation that resulted in the candidate. Since operations may span multiple objects, metadata-nodes potentially return many replica histories.
are executed by each metadata-node. When metadata-nodes perform these operations they are able verify the integrity of the request and the result. Query operations require an object history set constructed from the replica histories returned by a recent query operation.
To increase efﬁciency, object history sets can be cached by clients. Both query and update operations may require repair.
Figure 5.5 describes query, updates, and repair operations.
Figure 5.5(a) illustrates the requests and replies a client exchanges with a metadata-node to perform a query operation. Figure 5.5(b) illustrates the requests and responses a client exchanges with a metadata-node to perform an update operation. Figure 5.5(c) illustrates the requests and responses a client exchanges with a metadata-node to perform repair.
5.3 Improving efﬁciency 99
5.3 Improving efﬁciency Two additional considerations for metadata objects are presented within this section. The ﬁrst is an optimization to handle large metadata objects. The second details an approach to efﬁciently synchronize object replicas.
5.3.1 Breaking large objects into blocks Metadata objects could be very large (e.g., a directory object with thousands of ﬁles). To efﬁciently handle large metadata objects, metadata object replicas can be broken into ﬁxed sized blocks. Even though the metadata object replica is broken into blocks, metadata operations still occur atomically on the metadata object.
If metadata objects can be stored in a structured fashion, update operations can be implemented to be considerate of block boundaries (e.g., not allowing directory entries to span blocks). In so doing, the number of blocks modiﬁed by an update operation can be minimized. If the state of metadata objects cannot be stored in a structured fashion, then techniques like those used in the Low Bandwidth File System [Muthitacharoen et al.
2001] could be employed to minimize the number of “chunks” that a metadata update operation modiﬁes.
The value veriﬁer of the logical timestamp for a CW operation on a large metadata object is a collision-resistant hash of the list of the replica’s block hashes. The cost of generating the veriﬁer is linear in the number of blocks that comprise the object. For extremely large objects, Merkle hash trees [Merkle 1987] should be considered.
5.3.2 Object synchronization
Since update operations only execute at a subset of metadata-nodes, it is possible for some metadata-nodes to become “out-of-sync” (i.e., to not host the most recent complete candidate). To perform an update operation, the metadata-node requires the version of the object at the candidate’s timestamp. A metadata-node can “sync” its object replica by fetching the value corresponding to the latest complete candidate directly from another · 100 Efﬁcient, scalable consistency for highly fault-tolerant storage metadata-node. There is enough information in the object history set for the metadatanode to know which other metadata-nodes host the candidate. As well, the metadata-node can validate the correctness of the object value received with the veriﬁer in the candidate’s timestamp.
To sync a large metadata object, a metadata-node requests the hash list (or tree) for the candidate object from another metadata-node that hosts the candidate. The value veriﬁer in the timestamp validates the correctness of the hash list returned. Given the hash list for the candidate object version, the metadata-node can request only the out-of-date blocks.
The hashes in the hash list validate each block of the metadata object.
5.4 PASIS metadata objects
The PASIS metadata service (PMD service) exports a number of metadata objects. Each type of metadata object consists of internal state and provides a set of deterministic operations that can be performed on the object, as described in Section 5.3. Some operations span multiple objects—for example, a rename operation is performed on a pair of directory objects. Others may be read-only. This subsection describes the design of four types of metadata objects: directory objects, attribute objects, lock/lease objects, and authorization objects. Directory and attribute objects are fully implemented. Lock and authorization objects are designed but not yet implemented. Implementation details of directory and attribute objects are described within this section. The implementation of the PMD service, in the context of a distributed NFS framework, is described in Section 5.5.
The design of the PASIS metadata objects focuses on minimizing the access concurrency experienced by any one metadata object. Reducing the amount of update concurrency experienced by metadata objects improves the efﬁciency of the underlying R/CW protocol actions.
5.4 PASIS metadata objects 101
5.4.1 Attribute objects
An attribute object exists for each ﬁle stored in the PASIS storage service. The attribute object contains the per-ﬁle information expected by the clients (e.g., the NFS server and the NFS clients). In our implementation, these attributes map directly to typical UNIX ﬁle attributes (e.g., mode, link count, uid, gid, size, mtime, ctime, etc.).
There is a tradeoff between storing attributes in separate objects versus storing them within their parent directory entry. If stored within the directory, operations that access attributes and directories need only access a single metadata object. However, storing attributes within directory entries increases the false sharing of the directory object for any operation that operates on the attributes without operating on the directory (e.g., setattr and getattr). Since there may be many ﬁles managed by each directory object, the read and update trafﬁc for these attributes could generate frequent concurrent accesses to the directory object. Additionally, hard links (i.e., multiple names for the same object) cannot be easily supported if attributes are stored within directory entries.
5.4.2 Directory objects
Directory objects store the names and access information for ﬁles and other directories.
Access information speciﬁes how the named object can be accessed (e.g., where the objects are located and their encodings, not access control information). The access information for directories is PMD service speciﬁc. The access information for ﬁles is storage service speciﬁc. For example, if the R/W protocol is being used as the protocol underlying the storage-service, the access information will contain the protocol parameters (e.g., N, m, QC, etc.) and the encoding scheme being used (e.g., replication, IDA, etc.).
The attributes of a directory are stored in the directory object itself—a separate attribute object is not used. Since most operations that access a directory object also access the directory’s attributes, this design decision does not contravene the design goal of separating objects to minimize access concurrency. Indeed, directories maintaining their own attribute information allows for greater efﬁciency at the storage-node and over the net· 102 Efﬁcient, scalable consistency for highly fault-tolerant storage work: object histories need only be maintained (and returned) for the directory objects.
On the other hand, since ﬁles can use a separate storage-service protocol, attributes and data must be updated independently.
The directory object, since it may be large, is stored as a collection of blocks (see Section 5.3.1). The directory object block size is 4 KB, in our implementation. A simple structure is used in the implementation of the directory objects; it is just a list of directory entries. Each directory entry is a name, access pair. The access information encodes the object’s ID, the set of node IDs that host the named object, and the scheme which describes the encoding of the object (e.g., the replication factor). The object encoding is speciﬁc to the service owning the named object (i.e., either the PMD or the PASIS storage-service).
To look up a name in the directory object, a linear search of its directory entries is performed. When entries are added to the directory object, care is taken to avoid splitting them across block boundaries. When entries are deleted, no compaction is performed, but the free space created may be used for future entry insertions. Standard improvements to directory implementations, such as using b-trees to avoid linear searching, could be applied.
5.4.3 Lock (lease) objects
Lock objects provide serialization points. Locks are not needed for metadata object consistency, since the Q/U protocol ensures that all metadata operations occur atomically.
However, lock objects may be desired by clients wishing to control access to data. As such the design and implementation of the lock objects is dependent on their use. By providing the ability to implement lock objects using the Q/U protocol, locks are guaranteed to have the same fault-tolerance, consistency, and scalability guarantees as the metadata.
If lock objects are implemented in this manner, the storage service must be able to validate locks presented to it. Recall, the storage service is implemented by a distinct set of storage-nodes. We envision three possible scenarios for lock validation. First, capabilities are generated as the result of lock operations; these capabilities can be veriﬁed by storage·