«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
Second, we assume that comprehensive object versioning at each metadata node is efﬁcient. Previous studies have shown that versioning nodes can offer performance that is typically within 10% of a non-versioning node [Strunk et al. 2000]. As well, modern disks have the capacity required to version objects comprehensively [Strunk et al. 2000;
Soules et al. 2003].
· 6 Efﬁcient, scalable consistency for highly fault-tolerant storage Third, we assume that objects exported through the protocols, designed properly, will experience low access concurrency. Most ﬁle system studies conclude that ﬁle sharing is rare. For example, our R/CW objects support conditional write operations that update multiple objects atomically. This, in turn, permits us to utilize ﬁne-grained metadata objects, which reduces access concurrency for these objects. Thus, a separate attribute object can be maintained for each ﬁle, rather than including ﬁle attributes in directory objects.
There are a number of system model assumptions that hold for all protocols developed.
The system model is more formally described in Section 3.1, but can be summarized as follows.
Each data-item is hosted by a static number of storage-nodes; i.e., once the data-item has been created, the set of storage-nodes on which that data-item can exist is ﬁxed. There are an arbitrary number of clients in the system. Both storage-nodes and clients may suffer Byzantine faults [Lamport et al. 1982].
All protocols are developed within an asynchronous model of time (i.e., no assumptions are made about message transmission delays or execution rates). Channels are assumed to be point-to-point, authenticated, and adhere to ﬁnite duplication and fair loss properties [Aguilera et al. 2000]; see 3.1 for a complete description of the system model.
1.3.3 Applying the protocols to the PASIS storage system
The R/W protocol underlies the PS service. It provides block granularity read/write access to data objects. Data objects are variable length data containers named by a unique object identiﬁer. The R/W protocol allows for the use of space-efﬁcient data encodings.
To demonstrate that our protocol is efﬁcient in practice, we compare its performance to BFT [Castro and Liskov 2001; 2002], the Byzantine fault-tolerant replicated state machine implementation that Castro and Liskov have made available [Castro and Rodrigues 2003]. Experiments show that the PS scales better than BFT in terms of network utilization at the server and in terms of work performed by the server. Experiments also show that response times of PASIS and BFT are comparable. Additionally, experiments show that the response time graphs of the PASIS R/W prototype are ﬂat as the number of faults tolerated is scaled up.
Two types of metadata objects are implemented: attributes and directories. Attributes objects exist for both directory objects and for ﬁles. The attributes map directly to typical UNIX ﬁle permissions. Directory objects hold multiple directory entries. Each directory entry stores the names and access information for the ﬁles and directories stored within the storage system. The access information speciﬁes how the named object can accessed.
If the named object is a ﬁle, the access information is speciﬁc to the PS service implementation (e.g., where the ﬁle is located, the encoding of the ﬁle, etc.).
The PMD service is evaluated in the context of a complete ﬁle system implemented as a NFS server. It can use either the PS service to store data, or it can be conﬁgured to store data locally in its local ﬁle system. When storing data locally, experiments show that the PMD service’s throughput scales as load (number of NFS servers) is increased and response time only gradually increases as the number of faults tolerated is scaled up.
As well, experiments show that the performance degrades gracefully when concurrency is introduced, even at very high concurrency levels. Finally, when the PS service is used in conjunction with the PMD service in a conﬁguration capable of tolerating a single Byzantine fault, the run time of an OpenSSH build is within a factor of two of a non-fault tolerant user-level NFS server.
· 8 Efﬁcient, scalable consistency for highly fault-tolerant storage
The remainder of this thesis is organized as follows. Chapter 2 describes background and related work. It is broken into a discussion of atomic read/write objects (or registers) that pertains to block based storage and a discussion of systems/protocols capable of providing consistency and fault-tolerance for operations performed on arbitrary objects. Chapter 3 develops the R/W protocol for block-based storage. The system model, constraints on the number of storage-nodes, and the implementation and evaluation of the protocol are described. Chapter 4 describes the R/CW protocol for block based storage. The protocol is developed similarly to the R/W protocol. Chapter 5 extends the R/CW protocol to provide consistency for operations performed over arbitrary objects (i.e., the Q/U protocol). As well, the chapter describes the design and implementation of the PASIS storage system that utilizes both the Q/U protocol and the R/W to provide strong consistency, fault-tolerance, and scalability to its clients. The storage system is then evaluated in terms of a distributed NFSv3 storage system. The last chapter, Chapter 6, concludes and provides future directions for this work. Finally, a set of appendices provide proofs of safety for the consistency protocols developed within.
2 Background and Related Work This chapter describes background and related work related to the construction of scalable and fault-tolerant distributed storage systems. First, the components that comprise a storage system are described. Second, data encoding schemes that can be used to improve space-efﬁciency are introduced. Third, consistency semantics and protocols for tolerating benign and Byzantine faults are described. Fourth, and lastly, work related to the scalability of metadata services is discussed.
2.1 Storage system overview
Traditionally, disk-based storage systems have been built around a centralized monolithic disk array or mainframe. While these systems have been shown to provide good reliability and performance, they have a number of weaknesses. First, the hardware is highly customized and very expensive to build. Second, these systems are hard to scale to very large sizes. Third, the range of faults they are able to handle is limited (e.g., benign single, or possibly double, disk failures).
This thesis describes protocols that can be used to build a Byzantine fault-tolerant, decentralized storage architecture to help solve these problems. First, by tolerating Byzantine faults cheaper, off-the-shelf, components can be used since hardware and software bugs can be masked by the fault-tolerance provided by the underlying storage protocols.
Second, these systems are more scalable in that the addition of new storage-nodes yields improvements in the capacity, throughput or fault-tolerance of the service. Third, faulttolerance is gained by designing the storage protocols to withstand arbitrary (Byzantine) · 10 Efﬁcient, scalable consistency for highly fault-tolerant storage failures of clients and a limited number of metadata-nodes, and by requiring no timing (synchrony) assumptions for correctness. However, in this type of architecture, there is no centralized control, making it difﬁcult to provide consistency in the face of faults and concurrency.
2.1.1 File service
This work focuses on developing protocols that can be used to construct a decentralized, fault-tolerant ﬁle based storage-system. Traditional ﬁle systems are comprised of both metadata and data services. The data service is responsible for storing ﬁle data, while the metadata service stores data about how and where the ﬁle data is stored (e.g., block pointers within inodes), as well as other metadata that describes the ﬁle (e.g., attributes, access control information, etc.). Metadata is often stored within the data service and is accessed by recursing through a set of structures rooted at a well-known location.
In these systems, fault-tolerance for both the data and metadata can be obtained by distributing the data service in a fault-tolerant manner. Frangipani [Thekkath et al. 1997] is an example of this type of system. It is a distributed ﬁle system that is built above a virtual disk interface exported by Petal [Lee and Thekkath 1996] and a distributed lock service. Petal can tolerate one or more disk or storage-node failures, as long as the majority of the storage-nodes are up and communicating, and as long as at least one replica of each data-item remains.
Other systems explicitly separate the metadata service from the data service. For example, NASD [Gibson et al. 1998] demonstrated that by separating metadata access from data access greater scalability could be achieved at a lower cost. Instead of forcing all operations through a centralized ﬁle server, NASD eliminated the ﬁle server from the data ﬂow path by allowing clients to directly access the data storage-nodes. To increase faulttolerance the centralized metadata server can be distributed as a fault-tolerant service. For example, Farsite [Adya et al. 2002] utilizes a Byzantine fault-tolerant agreement protocol (BFT [Castro and Liskov 1998a]) to protect the integrity of its metadata, while allowing ﬁle data to be stored on a user’s desktop machine.
2.1 Storage system overview 11
Consistency semantics can differ for data versus metadata. Most block based data services, disk drives being the most common, expect whole block updates (i.e., an entire block is always overwritten). On the other hand, metadata services often allow arbitrary data regions to be updated independently (e.g., a single directory entry may be altered within a directory).
For block updates, it is sufﬁcient to support read–write update semantics. Read-write update semantics make no guarantees about the value of the data block between the time the block was read and later written. These semantics are sufﬁcient for block stores, since consistency is guaranteed on a block-level and blocks are usually read and written as atomic units. The PASIS read–write (R/W) protocol is described in Chapter 3 and provides the consistency semantics required for block based storage.
In order to support consistent updates to metadata, metadata objects (e.g., directories) require update operations that modify their existing contents, rather than blindly overwriting their previous contents; otherwise, their integrity may not be preserved. Read–modify– write semantics guarantee that the data region has not been modiﬁed between a read and a successive write operation to the same data region. It is also necessary to support atomic updates across multiple objects (e.g., when renaming or moving ﬁles from one directory to another). Metadata services are often built upon protocols that provide consistent access to objects that can be manipulated through arbitrary operations (i.e., not just read and write operations). In the PASIS metadata service, the underlying read–conditional write (R/CW) protocol is described in Chapter 4, while the query/update (Q/U) protocol, that extends the R/CW protocol to provide replicated state-machine semantics, and the metadata service itself is described in Chapter 5.
This thesis describes a set of protocols that provide the consistency necessary to implement fault-tolerant data and metadata services.
· 12 Efﬁcient, scalable consistency for highly fault-tolerant storage 2.1.2 Storage-system goals One central goal in the design of storage systems is to simultaneously provide efﬁciency, scalability, and fault-tolerance. Current storage systems, and their underlying protocols,
fall short in one or more of the following areas:
– High fault-tolerance: To provide access to data in the event of multiple client and/or server failures (in the case of both crash and Byzantine faults), as opposed to tolerating only a single failure as can be handled by most other distributed storage systems. First, data must be spread redundantly across the set of storage-nodes.
Second, no central points of failure should exist. This can be achieved by using decentralized consistency protocols with no single points of failure.
– Strong consistency: To provide strong consistency in the face of failures (of clients or servers) and concurrent operations (e.g., read-write concurrency, write-write concurrency). In decentralized storage systems, where data is spread across multiple storage-nodes, it is usually important to ensure that readers and writers always see a consistent view of data, especially in the face of concurrency and failures.
Although this is a goal that we want of our storage systems, not all applications require strong consistency. As well, the consistency semantics required of block level storage versus metadata is different. At the metadata level, it is important to offer consistency of metadata operations which may span multiple objects.
the worst case generally provides support for many system and failure model assumptions, efﬁciency and scalability are always limited to that of the worst case environment.
2.2 Data encodings A common data distribution scheme used in distributed storage systems is replication, in which a writer stores a replica of the new data-item value at each storage-node to which it sends a write request. Since each storage-node has a complete instance of the data-item, the main difﬁculty is identifying and retaining the most recent instance. It is often necessary for a reader to contact multiple storage-nodes to ensure that it sees the most recent instance. Examples of distributed storage systems that use this design include Harp [Liskov et al. 1991], Petal [Lee and Thekkath 1996], BFS [Castro and Liskov 1998a], and Farsite [Adya et al. 2002].