«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
5.4 PASIS metadata objects 103 nodes (as in NASD [Gibson et al. 1998]). Second, all accesses to the storage service are serialized through the metadata service. Third, each storage-node veriﬁes each access before performing it; i.e., each storage-node issues a query operation to the PMD service that returns the lock status.
There are two basic uses of lock/lease objects in distributed ﬁle systems: to maintain client cache consistency within the storage service and to provide application locking of data (i.e., ﬁle locking). To maintain client cache consistency, clients must be notiﬁed of changes to cached data. In such an approach, callbacks from the metadata service would be needed to notify holders of cached data that the data is stale. To maintain the fault-tolerance of the system, the application server ought to wait for b + 1 callbacks before acting; however, since caching is done for performance, not correctness, it is safe to invalidate cache entries based on a single callback.
Since fault-tolerant systems should not rely on potentially faulty clients to release locks, lock objects should provide lease semantics. Achieving lease semantics requires that locks timeout. The R/CW protocol is developed in an asynchronous model of time, so that invalid timing assumptions cannot break the properties provided by the R/CW protocol. In practice, loosely synchronized clocks are common and, if used wisely, can expire acquired locks.
5.4.4 Authorization objects
Authorization objects manage the privileges associated with metadata objects. There are two standard approaches to managing privileges: access control lists (ACLs) and capabilities. ACLs manage privileges on a per-object basis whereas capabilities manage privileges on a per-client/user basis. Either approach to privilege management can be implemented with authorization objects. An authorization object can be associated with each metadata object, and operations on the metadata object will only be performed if authorized.
Authorization objects may be needed for the storage service as well. Validation of authorization objects can occur similarly to the validation of locks. For example, the storage service can perform a read of the authorization object before permitting data to be read or · 104 Efﬁcient, scalable consistency for highly fault-tolerant storage Figure 5.6: PASIS storage system. Components of the PASIS storage system are shown above.
The PASIS storage system is split into two components: a client and a set of storage-nodes. The client implements an NFSv3 server. The NFS server consists of a PASIS metadata (PMD) component and a PASIS storage (PS) component. A single NFS server is able to support multiple concurrent NFS clients. Alternatively, the NFS server may be mounted via loop-back on the same machine as the NFS client.
written. Or, the application server can provide a capability to the storage service to read or write speciﬁc data.
5.5 Storage-system implementation
5.5.1 Metadata operations Table 5.1 lists the set of metadata operations that are currently implemented by the PMD service. The operations are inspired by NFS, but are generic enough to support many ﬁle system instances. The Type ﬁeld speciﬁes whether the operation is an update or query operation. Example query operations include: getattr, lookup, readdir, and readlink.
The Object ﬁeld speciﬁes the number and types of metadata objects on which that operation operates. In the case of operations that span multiple objects, more than one metadata object is listed. For example, remove modiﬁes the parent directory object and the link count attribute stored within the ﬁle’s attribute object.
As can be seen, many operations operate on directory objects. Many of these operations modify directory attributes as well as modifying directory entries (e.g., create, remove, etc.), thus justifying our design decision to encapsulate attributes within the directory object.
5.5.2 PMD metadata-nodes
The metadata-nodes use the Comprehensive Versioning File System (CVFS) [Soules et al.
2003] to store data objects and their versions. The query/update extensions to the R/CW protocol, as described in Section 5.3, have been implemented, as have object synchronization and multi-object repair. Additionally, each metadata operation described in Table 5.1 · 106 Efﬁcient, scalable consistency for highly fault-tolerant storage has been fully implemented.
CVFS objects On the metadata-node, each metadata object is associated with three CVFS objects. One CVFS object is used to store the metadata object’s internal state (e.g., the directory structure). Attributes are stored within the extended attribute ﬁeld of this CVFS object’s attributes. Another CVFS object stores the metadata object’s history, while the third stores a hash tree computed over the object’s internal state (to support large objects, see Section 5.3.1). The metadata object’s history and internal state are versioned on every update.
These versions can be garbage collected once the metadata-node classiﬁes a later update operation as complete (i.e., on the next successful update of the metadata object). Note, completed barrier operations do not result in this version history compaction.
Along with the metadata object’s history, query operations optimistically return the result of the operation performed on the latest version of the metadata’s internal state (as described in Section 5.2.1). A special query operation,readhist, is used to read only an object’s history. Batching of readhist results is supported (i.e., history from multiple objects can be returned by a single call). As well, all update operations also return the history associated with each object present in the operation. This history can be cached by clients to reduce the number of read history queries. Each metadata-node generates N authenticators over the object histories using HMACs based on pair-wise symmetric keys.
We use a publicly available implementation of MD5 for all hashes [Rivest 1992]. Each HMAC is 16 bytes long.
operation ordering at that storage-node. This can help prevent unnecessary object syncing from occurring when objects are executed out-of-order, as is described in the following example.
Imagine the following sequence of operations pending at a single metadata-node at the same time: 1) create (a, /), 2) create (b, /), 3) setattr (b). It should be noted, that, if a correct client performed the operations, it is only possible for operations (2) and (3) to be pending concurrently if operation (2) has completed successfully and operation (3) is conditioned on (2); this can occur on a slow storage-node, since only a subset of the updates need to complete for the operation to complete, but updates are transmitted
everywhere. If only object locking is performed without preserving operation ordering:
operation (1) locks the ’/’ directory; operation (2) blocks on the lock held by the ’/’ directory; operation (3) attempts the setattr although the create has not yet completed on this metadata-node—in this case object syncing would attempt the create.
Update operation validation
After each replica within the operation has been locked, each object history set is validated. Once validation has successfully completed (for all objects), the update operation is performed. Validation is the same as for the R/CW protocol, with two exceptions. First, since the conditioned-on timestamp is calculated from the object history set (passed in by the client), no validation is performed on the condition-on timestamp (line 735 and 741 of Figure 4.8). Second, since update operations are transmitted, as opposed to full data objects in the R/CW protocol, there is no Veriﬁer Data to validate. However, if repair is being performed, metadata-nodes must validate that the correct operation is being performed. To do this the operation hash is compared to the repairable candidate’s operation hash; recall, the operation hash is stored within the timestamp.
If the operation completes successfully, a hash is generated over the replica’s updated contents and is added to the object’s hash tree. Each replica history is updated with the new timestamp computed from a hash of the object’s hash tree, the operation’s hash, and the hash of the object history set (which was used to validate the operation—as described · 108 Efﬁcient, scalable consistency for highly fault-tolerant storage in Section 5.2.2).
Object name uniqueness Each object within the PMD service is given a unique object ID (OID). Likewise, each ﬁle stored by the storage service is also identiﬁed by an OID. Object IDs are stored within directory entries to uniquely identify the ﬁle or directory to which the entry is linked.
Within the PASIS storage system, OIDs are similar to the inode numbers used by traditional ﬁle systems (or ﬁlehandles used by NFS). However, unlike traditional ﬁle systems, OIDs are not be centrally assigned. This complicates the validation performed during object creation.
In the PASIS storage system, the client is responsible for generating a 256 bit OID.
The client generates a 256 bit random number that it uses as the OID. The client then performs a read history query operation on the newly generated OID. If a metadata-node hosts the OID, it returns the replica history associated with the OID, if not, the metadatanode returns a special null replica history (a history with a single timestamp of 0). As well, the history of the parent directory object is also read.
When performing a create or a mkdir operation, the metadata-node validates the object history set to ensure that the create OID’s latest complete timestamp is 0. If a create operation succeeds (i.e., it receives successful responses from QC + b metadata-nodes), the client is ensured that the OID it generated is globally unique. If a create operation fails (i.e., is classiﬁable as incomplete), the metadata-node is free to accept a create operation from a different client of the same OID; since the latest complete timestamp is still 0. The null history entry remains part of the replica’s history until it is pruned by a subsequent update operation that observes a completed create. Validation is similar for the repair of a create operation: 0 must be the latest complete timestamp; and the operation hash of the repair operation must match the operation hash stored within the repairable candidate’s timestamp.
To remove an OID (e.g., through a unlink or a rmdir operation), the replica history associated with the OID must be reset to the initial null value. Thus, the OID is only free ·
5.5 Storage-system implementation 109 once a remove operation has completed successfully.
5.5.3 PMD clients A client library has been implemented to facilitate interfacing with the PMD service.
The library’s interface consists of the set of metadata operation service calls (with the exception of readhist, which is not exported externally). The implementation of the query and update operations follows the presentation in Section 5.3.
A NFSv3 server has been implemented that uses the client library. All NFS metadata operations have been mapped to PMD service operations. NFS data operations (ﬁle read/write) are mapped to calls within the storage-service. There is a one-to-one mapping between NFS ﬁlehandles and PASIS OIDs.
Some NFS operations require multiple PMD operations. For example, there is a disconnect between the arguments of the NFS unlink operation and the PMD unlink operation. The NFS unlink operation takes a ﬁlename and a directory ﬁle handle as arguments, while the PMD unlink operation requires an additional argument, the ﬁlename’s OID. The ﬁlename’s OID maps to the attributes of the ﬁle, which may be updated by the unlink (e.g., the link count would be decremented). In order to perform this update operation, validation must be performed over the object’s history set. Thus, the OID of the ﬁlename’s attributes is required to construct its object history set. Therefore, a PMD lookup is performed prior to the unlink operation. Additionally, during the PMD unlink operation, the metadata-node validates that the ﬁlename matches the OID passed in.
Client history caching
To reduce the number of read history operations, object history sets are cached by the client. Every metadata operation request in the PMD service returns a replica history from each metadata-node executing the request. Histories are returned even if the request · 110 Efﬁcient, scalable consistency for highly fault-tolerant storage fails to execute. Since histories are cached, they may become out-of-date, or stale. A stale replica history will cause the request to fail validation at the metadata-node from which the replica history originated (see line 728 in Figure 4.8). An up-to-date replica history is returned by the metadata in response to receiving a stale history; thus, the client can update its cache and retry the operation.
Retry and concurrency
Although the NFS server locks each ﬁlehandle associated with each operation at the PMD client, operations may still abort due to concurrency. Thus, operation retry is necessary.
Upon retry, new object histories must be obtained and classiﬁed. The operation is based upon these new histories. Many different policies regarding backoff and retry may be implemented to avoid retrying operations concurrently. This is particularly relevant in the face of repair, since repairs issued concurrently may cause livelock if they execute at metadata-nodes in an interleaved order that prevents any repair from completing successfully. This work does not focus on the policies regarding backoff and retry, however it is discussed further in the evaluation section.
5.5.4 Storage service