«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
service is conﬁgured to tolerate a single benign metadata-node fault and Byzantine clients:
t = 1, b = 0, and N = 4. Postmark is run on an increasing number of NFS servers to determine the maximum throughput of the PMD service in this conﬁguration. To ensure that the PMD service is as loaded as possible, the NFS server uses local storage for the storage service. Postmark is designed to benchmark a single NFS server. However, it is being used to benchmark a decentralized service behind an NFS interface. As such, we “scale” the number of ﬁles, transactions, and directories for each Postmark/NFS server down, as we scale the number of Postmarks/NFS servers up. This is done to maintain a consistent working set across runs. Each NFS server runs Postmark in a different directory of the PMD service. The working set ﬁts within the cache on the metadata-nodes.
Figure 5.9 shows the throughput of the PMD service with up to 16 distinct NFS servers.
For a single NFS server, Postmark is conﬁgured for 32768 transactions, 1024 ﬁles, and 64 directories. Each NFS server has a single Postmark benchmark run against it. The Postmark conﬁguration is scaled down as the number of clients is scaled up, thus keeping the working set size the same. For example, with 16 NFS servers, Postmark scales down to 2048 transactions, 64 ﬁles, and 4 directories. The PMD service saturates just below 350 transactions per second.
Figure 5.9: Postmark throughput vs.
client load. This graph compares the total system throughput of a Postmark workload as the number of NFS servers (PMD clients) increase in fault-free operation. Each client runs a single instance of Postmark against a NFS server mounted via loopback on the same machine. The sets of bars represent a conﬁguration with b = 0,t = 1, N = 4 and b = 1,t = 1, N = 6.
tolerated is scaled from b = t = 1 to b = t = 3. Figure 5.10 demonstrates that as the number of failures tolerated scales, the responsiveness of the PMD service is fairly ﬂat for the benign conﬁguration and degrades only moderately for the Byzantine conﬁguration. This degradation is expected, for a number of reasons. First, more cryptography is being performed by metadata-nodes (e.g., for b = 3 authenticators are comprised of 16 entries, since N = 5b + 1). Second, all storage-nodes are being communicated with, thus the communication costs grow as N increases. Additionally, this experiment shows the performance cost of a fully Byzantine-tolerant system is not prohibitive (at least for low values of b).
· 120 Efﬁcient, scalable consistency for highly fault-tolerant storage
Figure 5.10: Postmark run time vs.
total failures tolerated (t). This graph compares the runtime of a Postmark workload as the number of tolerated faults (t) is scaled upward. The two lines represent a wholly crash environment with b = 0, while the other represents a wholly Byzantine environment with b = t.
Scaling throughput using threshold quorums
Recall, Section 4.5 describes techniques that can be used to scale the system’s throughput by adding storage-nodes. For example, the smallest conﬁguration with t = 1, b = 1 is N = 6, COMPLETE = 5, and INCOMPLETE = 3. It is possible to increase the bounds on the R/CW constraints in the following way: by adding 3∆ to N, 2∆ must be added to QC, COMPLETE, and INCOMPLETE. Thus, as ∆ increases the lower bound on the load of the system is 2.
Table 5.3: Threshold quorum experiment parameters (b = t = 1).
This table shows the derived parameters when using theshold quorums. The ﬁrst column shows ∆. The second column shows the value of N. The third column shows the size of each threshold quorum (|Quorum| = QC + b).
The fourth column gives the quorum load of each storage-node (i.e., the fraction of operations for which a storage-node must execute a request). Lastly, the ﬁfth column shows the calculated system throughput normalized to the throughput of ∆ = 0.
shows the expected system throughput normalized to ∆ = 0.
Figure 5.11 shows the throughput of the PMD service when using a threshold quorum construction, as ∆ increases.
The throughput is normalized to the throughput obtained at ∆ = 0. Two curves are plotted on the graph. The ﬁrst line shows the calculated throughput (see column 5 in Table 5.3). The second line shows the maximum throughput attained by running a heavy weight synthetic update operation containing a 4 KB argument and a 10 ms storage-node think time.
To measure the throughput, many clients, each with multiple outstanding queries or updates are employed. The reported throughput measurements are for a saturated system (i.e., adding more clients does not increase throughput) We employ an access strategy based on a deterministic function of the object ID. Such an access strategy results in updates for a given object preferentially accessing the same quorum (i.e., the preferred · 122 Efﬁcient, scalable consistency for highly fault-tolerant storage 1.2 1.15
Figure 5.11: Throughput of threshold quorum system vs.
∆. This graph shows the normalized system throughput as ∆ is increased. Throughput is normalized to that of ∆ = 0. See Table 5.3 for a description of the relationship between ∆, N, and throughput. Two curves are shown. The ﬁrst line shows the calculated throughput as described in Table 5.3. The second line shows the throughput attained when running a 4KB update operation.
quorum). Accessing a service that is implemented by an ensemble of Q/U objects, via each objects preferred quorum, approximates a traditional quorum access strategy. As can be seen, the synthetic update line closely follows the calculated throughput curve.
An additional experiment was run using a recursive threshold construction [Malkhi et al. 1997]. For the recursive threshold construction (with t = b), N = (5b + 1)∆+1 and |Quorum| = (4b + 1)∆+1 (i.e., ∆ indicates recursion depth). With t = b = 1 (∆ = 1, N = 36, |Quorum| = 25), the achieved throughput, normalized to ∆ = 0, was 1.19 versus 1.20 for the normalized theoretical throughput.
5.6.5 PASIS ﬁle system: SSH-build
the PMD service operating in unison with the PS service. This benchmark consists of three phases: unpacking the OpenSSH archive, running configure, and compiling the OpenSSH binaries. The unpack phase stresses metadata operations on ﬁles of varying sizes by uncompressing and untaring the OpenSSH (v3.8p1) tar archive. The conﬁgure phase consists of the automatic generation of header ﬁles and Makeﬁles, which involves building various small programs that check the existing system conﬁguration. The build phase compiles, links, and removes temporary ﬁles. This last phase is the most CPU intensive, but it also generates a large number of object ﬁles and a few executables. Figure 5.12 shows the runtime of the SSH-build benchmark for four conﬁgurations.
None of the NFS conﬁgurations use the synchronous mount option. However, all NFS conﬁgurations use a 128 MB write-back cache with no data expiration time. The ﬁrst set of bars show a user-level NFSv3 server that stores ﬁles in the local ext3 ﬁle system.
The second set of bars show the performance of the same user-level NFSv3 server just described, however the data is stored across the network to a single CVFS storage-node.
This shows the overhead of using CVFS across an additional network link. However, · 124 Efﬁcient, scalable consistency for highly fault-tolerant storage in this conﬁguration metadata is treated as data (i.e., attributes and directory entries are stored within data blocks). The third set of bars show the cost of using the PASIS metadata service (with t = b = 1, N = 6) storing ﬁle data in the local ﬁle system. The fourth, and last, set of bars show the SSH-build benchmark run against the complete PASIS metadata and storage service. Both the metadata and storage service are conﬁgured with t = b = 1;
N = 6 for the metadata service and N = 5 for the storage service. As can be seen there is almost a 2x performance difference between an user-level NFS server with no faulttolerance and an NFS server backed by a Byzantine fault-tolerant metadata and storage service that provides strong consistency. The majority of the overhead is due to the extra communication required (11 storage-nodes in the PMD+PS case vs. 1 without), as well there is a non-zero cost in our CVFS and storage-node implementations.
This chapter describes the PASIS metadata (PMD) service. It uses a novel quorum-style query/update (Q/U) protocol to provide horizontal scalability for metadata, as is enjoyed for data in scalable storage systems. The PMD service extends the read/conditional write protocol, described in Chapter 4, to support more general query and update operations.
These operations provide access to objects at a ﬁner-granularity than do block-based protocols (e.g., reading/inserting directory entries vs. reading/writing full directories). In addition, atomic updates across multiple objects are supported.
Similar to the other protocols developed thus far, the Q/U protocol uses optimism and versioning to achieve efﬁciency while tolerating asynchronous communications and Byzantine failures of clients and servers. Experiments with a decentralized NFS ﬁle service demonstrate feasibility and efﬁciency. As well, performance under concurrency and faults is examined. Experiments also show that threshold quorum constructions can be used to signiﬁcantly increase throughput without requiring the partitioning of the metadata service.
6 Conclusions and Future Work
This thesis has demonstrated a novel approach to achieving scalable, highly fault-tolerant storage systems by leveraging a set of efﬁcient and scalable, strong consistency protocols enabled by storage-node versioning. These consistency protocols achieve efﬁciency and scalability via a combination of optimistic operation, versioning, and quorum-style redundancy.
Three consistency protocols have been developed that offer varying semantics useful for building different components within a survivable, decentralized storage-system.
The ﬁrst protocol, the read/write protocol (R/W), provides read–write semantics of full data blocks. This protocol is suitable as the basis for the data storage component within a survivable storage system, since most block based data services expect whole block updates.
The second protocol, the read/conditional-write (R/CW) protocol, provides read– modify–write semantics of full data blocks. While this protocol also assumes blocks (or data objects) are read and written as atomic units, it offers stronger consistency guarantees. These semantics guarantee that the data region has not been modiﬁed between a read and a successive write operation to the same data region.
The third protocol, the query/update (Q/U) protocol, extends the R/CW protocol to more fully support the semantics required by metdata. In order to preserve the consistency of metadata, metadata objects (e.g., directories) require update operations that modify existing contents (such as inserting a new directory entry), rather than overwriting their · 126 Efﬁcient, scalable consistency for highly fault-tolerant storage previous contents. As well, metadata usually requires atomic update operations across multiple metadata objects (e.g., when performing a rename, or moving ﬁles). The Q/U protocol provides for the serializability of multiple, arbitrary operations through the use of replicated state machines.
These protocols were developed in detail, evaluated individually, and used as a basis for building a fault-tolerant, scalable storage-system. Results show that the PASIS ﬁle system conﬁgured to tolerate one Byzantine fault is within a factor of two in the response time unpacking, conﬁguring, and building OpenSSH as compared to an unreplicated userlevel NFS server. The storage service component of the ﬁle system, using the R/W protocol, was shown to scale well in terms of both throughput and response time as number of faults tolerated is scaled up. As well, it performs well when compared to a Byzantine fault-tolerant agreement protocol (BFT) and by ofﬂoading work from storage-nodes to clients increases its scalability. Results also show that the PASIS metadata service, using the query/update protocol, scales with as the number clients is increased and reponse time increases slightly as the number of faults tolerated is scaled up. Additionally, the use of quorum thresholds enables the system’s throughput to scale close to its theoretical bounds and it is expected that other quorum constructions can further increase the system’s scalability.
6.2 Contributions This main contribution of this thesis is the design and evaluation of three consistency
protocols that have been enabled by versioning storage-nodes. These contributions are:
(1) The development and demonstration of a read/write block storage consistency protocol that enables highly fault-tolerant storage through the use of erasure coded data and versioning storage-nodes. Its correctness is shown through proof sketches.
offs between tolerating Byzantine clients and erasure coding, as well as tradeoffs between tolerating Byzantine storage-nodes and liveness have been discussed.
(3) The extention of the read/conditional-write block protocol (in the query/update protocol) to support operations on multiple, arbitrary objects and the implementation of a scalable metadata service based upon the query/update and the read/write protocol.
(4) The evaluation of a distributed ﬁle system that utilizes the scalability and faulttolerance of the developed consistency protocols in terms of the number of faults tolerated, the maximum throughput the system can sustain, and its performance in degraded operation modes (i.e., with concurrency and faults).
6.3 Future directions