«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
The PMD service is one part of a complete system, the storage service and the application server complete the system. In the case of a ﬁle server application, there is much ﬂexibility in how the metadata objects are used to provide ﬁle services. For example, locks, access privileges, and client caching of stored ﬁles involve the PMD service and the storage service. This subsection brieﬂy describes the selection of a storage service.
The interface and access protocol used by the storage service is independent of the protocol that underlies the metadata service. For example, the storage service may use either a block based (e.g., iSCSI or Fibre Channel) or an object based protocol to access storage. However, some coordination is required in the design of the metadata objects and the interfaces provided by the storage service. For example, if objects (i.e., ﬁles in a ﬂat ·
5.5 Storage-system implementation 111 namespace rather than blocks) are exported by the storage service, the metadata objects need not implement inodes or other structures to track block allocations.
Although we reject partitioning the namespace as a means to scale the metadata service, partitioning data for the storage service is reasonable. A storage system need not provide any guarantees about operations performed across multiple data objects; as such, partitioning is an appropriate technique for stored data. Partitioning allows different ﬁles to have different performance and reliability properties (e.g., /tmp need not be highly replicated).
The PASIS read–write protocol, described in Chapter 3, underlies our storage service. It provides block granularity read/write access to objects. The R/W protocol provides strong consistency (linearizability of block read/write operations) and fault-tolerance of erasure coded data (e.g., data encoded with Rabin’s information dispersal [Rabin 1989]). A PS service, implemented using the R/W protocol, can be relied upon to serialize all accesses to stored data. Such an approach is suitable for an application that controls concurrency itself. Alternately, locks could be provided by the PMD service so that the application does not need to provide concurrency control. Our PS service implementation also uses CVFS as its backing store; storage-nodes can either run collocated with metadata-nodes or not.
Another option is to use the R/CW protocol. The R/CW protocol offers stronger consistency semantics in that writes not based on the most current version will be rejected.
This has the nice property that the application server can implement very weak cache consistency (since writes based on stale reads will be rejected by the storage service).
However, these semantics come at an increased cost in terms of N and space-efﬁciency.
Instead, the versioning capability of the R/W protocol could be used to provide strong consistency across the entire ﬁle system through the use of immutable ﬁles. Clients that write ﬁles through the PS service could be required to update the associated metadata attribute object with the version of the latest completed write operation; thus ensuring · 112 Efﬁcient, scalable consistency for highly fault-tolerant storage consistency between the data and the attributes. However, this requires all data transfers to be serialized through the metadata.
5.6 Evaluation 5.6.1 Experimental setup All experiments are performed on a rack of 30 Intel P4 2.66 GHz machines with 1 GB of memory. Each computer has two 33.6G Seagate Cheetah 10K RPM SCSI disk drives and an Intel Gb Ethernet NIC. The computers are connected with a 24-port Gb switch.
Debian testing Linux kernel 2.4.22 is installed.
Many experiments use NFS servers as clients to the PMD service, while others communicate directly to through the PMD library interface. Multiple NFS servers are able to access the same PMD service simultaneously. The NFS servers are mounted via loopback on the same machines as the NFS client. The NFS servers implement the NFSv3 protocol. The NFS servers use buffer cache of 128 MB. Unless otherwise speciﬁed, the buffer cache is write-through and data is expired after 10 seconds. No attributes or metadata is cached by the NFS servers.
The storage-nodes use CVFS as the backing data store. Each storage-node has a frontend that communicates with CVFS over IPC. Each CVFS instance uses a 512 MB buffer cache. All experiments show results using write-back caching at the storage nodes, mimicking availability of 16 MB of non-volatile RAM. This allows us to focus experiments on the overheads introduced by the protocol and not those introduced by the disk subsystem.
5.6.2 Cryptography performance
Directory objects are implemented with 4 KB blocks (the large objects optimization).
On update operations, MD5 hashes of modiﬁed blocks are taken. Such hashes take 24.4 µs to generate over a full block. However, the hash is only taken over the utilized portion of the block.
5.6.3 PMD micro-benchmarks This subsection describes a number of micro-benchmarks performed against the PMD service. No ﬁle data is involved for any of these experiments, they only test the PMD service. The ﬁrst set of experiments examine NFS micro-benchmarks. The second set examines the impact of concurrency on PMD create operations and the third set examines the impact of a fault on the response time distribution of a run of create operations.
PMD NFS micro-benchmarks
All the PMD operations described in Table 5.1 have been implemented. Most of the NFSv3 operations map to corresponding PMD service operations, although a few require multiple PMD operations. NFS micro-benchmarks were performed against some of the NFS metadata operations. The mean response times for these operations are listed in Table 5.2. The PMD service was conﬁgured to tolerate one Byzantine fault; therefore six storage nodes were used. In addition to the end-to-end response time for the PMD service, the response times as observed by the PMD storage-nodes are also listed.
The response time for the create operation represents a create that occurs within a directory comprised of a single block. Due to the implementation of directory objects, a linear search is performed to ensure the name being inserted into the directory does not exist; thus, the larger the directory is in size, the longer the create takes. Likewise, the performance of the readdir operation is also dependent upon the size of the directory (since each directory block is being transmitted back to the client). Thus, two results for readdir are shown: one for a small directory containing 5 entries and one for a large directory containing 500 entries.
· 114 Efﬁcient, scalable consistency for highly fault-tolerant storage
Concurrency The impact of concurrency is examined in the context of PMD create operations. Three graphs show the results of performing PMD create operations with varying degrees of concurrency with t = 1, b = 0, N = 4. Two clients simultaneously perform create operations within a set of shared directories. Recall, create is a multi-object operation. In this experimental setup, the parent directory is the source of concurrency. To increase the likelihood of concurrency the set of shared directories is decreased between each run. In each run, each client randomly picks a directory to use as the parent by the create operation.
Each client has only one outstanding request at a time. Care was taken to fully overlap the execution of both clients.
The ﬁrst graph, Figure 5.7(a), shows the mean response time and standard deviation as concurrency is increased. The “None” bar represents a run with sixteen directories and only one client (i.e., there is no concurrency). It is not surprising that as the amount of concurrency increases, so does the response time as does the standard deviation. As concurrency is increased, repair and barrier operations become more common, as do the number of operations that must be retried due to stale object histories. Even at high concurrency levels (two clients sharing two directories), the mean response time and standard deviations are within a factor of two or three of a run with no concurrency.
The second graph, Figure 5.7(b), shows the total number of barrier and repair operations attempted at each concurrency level. These counts are normalized to the total number of create operations performed. Again, as concurrency is increased, the number · 5.6 Evaluation 115
Figure 5.7: Concurrency experiments.
These three graphs show the results of performing create operations with varying degrees of concurrency. Two clients simultaneously perform create operations into a ﬁxed set of directories. For each operation each client randomly picks a parent directory from the directory set. To increase the degree of concurrency the directory set size is decreased between each run. (a) Shows the mean response time and its standard deviation as concurrency is increased. (b) Shows the total number of barrier and repair operations performed at each concurrency level normalized to the number of create operations issued by the client. (c) Shows the total number of times an operation was rejected due to stale object histories, again normalized to the number of create operations.
of barrier and repair operations also increase. It is interesting to note that there is a large step increase from the very low concurrency conﬁgurations (“None” and “16”) to the higher concurrency levels (“4” and “2”).
The third graph, Figure 5.7(c), shows the number of total operations (including repairs and barriers) that are rejected due to stale object histories. Recall, the client caches replica histories returned by recent operations to construct the object history set for a subsequent update operation. As well, every operation returns an updated replica history (even if that operation failed). Examining the steep increase in stale object histories at the very high concurrency levels, one notices that there is often a race between the two clients trying to repair the same parent directory. Both clients try to write barriers, but neither client quite succeeds in completing a barrier (since the barrier is rejected from a subset of the nodes because the other client just wrote its barrier there). This observation requires the designer to carefully consider the back-off and repair policies when using optimistic protocols that can result in livelock. More work is required in this area.
· 116 Efﬁcient, scalable consistency for highly fault-tolerant storage
This experiment shows the impact of a fault occurring at a random point during a run of create operations. The system was conﬁgured with t = 1, b = 0, N = 4 and a single client that performed unique create operations continuously into one of sixteen directories (picked at random). The client had only a single request outstanding, so there is no concurrency. Each run lasted for 10 seconds. Randomly during each fault-induced run, one of the storage-nodes was killed. Seven fault-induced runs were performed. There were no correctness problems present in any of the fault-induced runs. The remainder of this subsection quantiﬁes the performance consequences of running with a failed server.
Figure 5.8 shows the mean response time and standard deviation of a single fault-free run and the accumulation of the response times from the seven fault-induced runs.
The mean response time for the fault-free run is 2.17ms with a standard deviation of 0.34ms.
The mean response time across all fault-induced runs is 2.47ms with a standard deviation of 0.35ms. Although the standard deviations of the fault-free and the set of fault-induced runs are similar, the standard deviation of each individual fault-induced run was between
0.52ms and 0.66ms. In general, fault-induced runs with a higher mean response time also had a higher standard deviation. Runs with higher mean response times also generally have a larger number of outlier response times (response times 3ms).
In a fault-free run, the client only waits for the fastest N − t responses. Once a failure has occured, the client still waits for N − t responses, but there are now only N − t servers, so it is waiting for all responses (rather than the fastest subsets). Thus, variations in individual storage-node response times are not masked. This accounts for the observed increases in the averages and standard deviations of response times within the fault-induced runs.
5.6.4 PMD service macro-benchmarks
Figure 5.8: Response time distributions of a fault-free and multiple fault-induced runs.
This ﬁgure shows the mean and standard deviation of response times for a fault-free run and multiple fault-induced runs for the create operation. At a random point during each of the other runs, the PMD process on one of the storage-nodes is killed. As can be seen, the mean response time for the fault-induced runs is higher than in the fault-free run. Although the standard deviations are almost the same (between the fault-free and the all fault-induced runs), the standard deviation in a single fault-induced run is higher than the standard deviation in the fault-free run.
mation about the performance of the PMD service. Postmark was designed to measure the performance of a ﬁle system used for electronic mail, netnews, and web based services. Postmark is comprised of two phases: (i) in the creation phase, it creates a large number of small randomly-sized ﬁles (between 512 B and 9 KB); and, (ii) in the transaction phase, it performs a speciﬁed number of transactions. Each transaction consists of two sub-transactions, with one being a create or delete and the other being a read or append. Three Postmark conﬁguration parameters are important to our experiment: ﬁles (determines the creation phase), transactions (determines the transaction phase), and directories (determines degree of access contention). Results from Postmark experiments are given in transactions per second over both phases.
· 118 Efﬁcient, scalable consistency for highly fault-tolerant storage
PMD service base throughput
The ﬁrst experiment determines the maximum throughput of the PMD service. The PMD