«Efﬁcient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»
It should be noted that, even after a write completes, it may be classiﬁed as repairable by a subsequent read, but it will never be classiﬁed as incomplete. For example, this could occur if the read set (of N − t storage-nodes) does not fully encompass the write set (of N − t storage-nodes).
To classify a candidate as INCOMPLETE a client must determine that a complete write does not exist in the system (i.e., fewer than QC benign storage-nodes host the write). For this to be the case, the client must have queried all possible storage-nodes (N − t), and must assume that nodes not queried host the candidate in consideration. So,
3.4.2 Real repairable candidates To ensure that Byzantine storage-nodes cannot fabricate a repairable candidate, a candidate set of size b must be classiﬁable as incomplete. Substituting b into (3.2),
3.4.3 Decodable repairable candidates Any repairable candidate must be decodable. The lower bound on candidate sets that are repairable follows from (3.2) (since the upper bound on classifying a candidate as
incomplete coincides with the lower bound on repairable):
Since slow storage-nodes cannot be differentiated from crashed storage-nodes, only N −t responses can be awaited. As well, b responses received may be from Byzantine storagenodes.
3.4.5 Constraint summary The summary of constraints is given in Table 3.1. The bounds on N (i.e., N 2t + 2b) have been shown to be optimal for systems with single round-trip write operations [Abraham et al. 2004]. A diagram that more intuitively shows the constraint on N is shown in Figure 3.7.
3.5 Implementation PASIS consists of clients and storage-nodes. Storage-nodes store data-fragments and their versions. Clients execute the protocol to read and write data-items.
3.5.1 Storage-node implementation
Figure 3.7: Illustration of constraint on N.
This ﬁgure shows the intuitive reasoning behind the constraint on N = 2t + 2b + 1. As shown, a write executes at any N − t storage-nodes, and a read executes at a set of N − t storage-nodes that has the minimum overlap with the write. The read only observes the data value on the storage-nodes within the intersection of the read and write.
Since b of these storage-nodes may be Byzantine b + 1 matching values must be observed. This leads to an intersection of size 2b + 1. Thus, t + (2b + 1) + t = N = 2t + 2b + 1.
ganization to reduce the cost of data versioning. Experience indicates that retaining every version and performing local garbage collection comes with minimal performance cost (a few percent) and that it is feasible to retain complete version histories for several days [Soules et al. 2003; Strunk et al. 2000].
We extended CVFS to provide an interface for retrieving the logical timestamp of a data-fragment. Implicitly, each write request creates a new version of the data-fragment (indexed by its logical timestamp) at the storage-node. In addition to data, each write request contains a cross checksum, a logical timestamp, and a linkage record [Amiri et al.
1999]. The linkage record consists of descriptions of the encoding scheme, and addresses of the N storage-nodes for a speciﬁc data-item; it is ﬁxed upon data-item creation.
By default, a read request returns the most current data-fragment version, ordered by logical timestamp. Read responses may also contain a limited version history containing logical timestamps of previously executed write requests. The version history allows clients to identify and classify additional candidates without issuing extra read requests.
Storage-node can also return read responses that contain no data other than version histories, which makes candidate classiﬁcation more network-efﬁcient.
· 40 Efﬁcient, scalable consistency for highly fault-tolerant storage
3.5.2 Garbage collection
Pruning old versions, or garbage collection (GC), is necessary to prevent capacity exhaustion of the storage-nodes. A storage-node in isolation cannot determine which local data-fragment versions are safe to garbage-collect, because write completeness is a property of a set of storage-nodes. A data-fragment version can be garbage-collected only if there exists a later complete write for the corresponding data-item. Storage-nodes can classify writes by executing the read protocol in the same manner as a client. However, no data need be returned for protocol members that do not tolerate Byzantine clients (since the cross checksum need not be validated). Linkage records provide sufﬁcient information for the storage-nodes to know which other nodes host relevant data-fragments.
Garbage collection is implemented in the current prototype and it requires no additional RPCs. We have implemented a heuristic to invoke GC whenever idle time is detected. Periodically, a thread wakes up and checks the system load. If the system load is low, then GC will be invoked. We use a timer-based idle-time detector, as described by Golding et al.[Golding et al. 1995]. This type of idle time detector was successfully used in cleaning heuristics for LFS, even in heavily loaded systems [Blackwell et al. 1995].
We also have a method for invoking GC externally of the system (e.g., by a CRON job).
It is usually inefﬁcient to perform GC for every block, since most blocks do not have old versions that need to be reclaimed (e.g., in a system with a read-heavy workload). We address this by adding a counter to the PASIS per-block metadata to track the number of writes to a block. This counter is incremented during each write operation and reset after GC has run. If the the block’s write-count rises above certain threshold, an entry identifying the block is added to an in-memory high-write-count table. When GC is executed, it ﬁrst searches this table. If an entry is found, it is removed and GC is executed on that block. If no entries are found, GC can scan the block-space sequentially. Although this heuristic works, further research into policy issues, such as the appropriate frequency and order of garbage collection, is warranted.
3.5 Implementation 41
3.5.3 Client implementation
Our client implementation follows the pseudo-code described above. The client module is accessed through a set of library interface calls. These calls allow an application to control the encoding scheme, the threshold values, and the failure and timing models.
The client protocol routines are implemented such that different protocol family members and thresholds may be speciﬁed for different data-items. Likewise, the storage-nodes for any given data-item are also speciﬁed via these interfaces, thus externalizing control (and responsibility) for such bootstrapping information; for our experiments we use a static set of N storage-nodes. Clients communicate with storage-nodes through a TCP-based RPC interface.
In an asynchronous environment, the client implementation issues TIME REQUEST requests to only N + b − QC storage-nodes, since this ensures overlap with the latest complete write. To improve the responsiveness of write operations, clients return after the ﬁrst QC + b storage-nodes respond; the remainder of the requests complete in the background.
To improve the read operation’s performance, only m read requests fetch the latest data pertaining to the data-fragment, while all receive version histories; this makes the read operation more network-efﬁcient. The limited data-fragment version history returned by read requests, allows clients to classify earlier writes without issuing additional storage-node requests. If necessary, after classiﬁcation, extra data-fragments are fetched according to the candidate’s timestamp. Once the data-item is successfully validated and decoded, it is returned.
3.5.4 Erasure codes
In our erasure coding implementation, if m = 1, then replication is employed, otherwise an information dispersal algorithm [Rabin 1989] is used. Our information dispersal implementation stripes the data-item across the ﬁrst m data-fragments (i.e., each data-fragment is of the original data-item’s size). This makes the erasure code a systematic encoding;
m Thus, concatenation of the ﬁrst m data-fragments produce the original data-item. These · 42 Efﬁcient, scalable consistency for highly fault-tolerant storage stripe-fragments are used to generate the code-fragments via polynomial interpolation within a Galois Field, which treats the stripe-fragments and code-fragments as points on some m − 1 degree polynomial. Our implementation of polynomial interpolation was originally based on [Dai 2003] (which conceptually follows [Rabin 1989]). We modiﬁed the source to use stripe-fragments and added an implementation of Galois Fields of size 28 that use lookup tables for multiplication.
Beyond our base erasure code implementation, we implemented secret sharing [Shamir 1979] and short secret sharing [Krawczyk 1994]. Our implementation of short secret sharing closely follows [Krawczyk 1994], using AES for the cipher. Such erasure codes can also provide a degree of conﬁdentiality with regard to storage-nodes.
Our implementation of cross checksums closely follows Gong [Gong 1989]. Our implementation uses a publicly available implementation of MD5 [Rivest 1992] for all hashes. Each MD5 hash is 16 bytes long; thus, each cross checksum is N × 16 bytes long.
3.6 Evaluation This section evaluates protocol family performance in the context of the prototype block storage system.
3.6.1 Experimental setup
included in all experiments.
3.6.2 Performance and scalability of PASIS protocol members PASIS conﬁguration Each storage-node is conﬁgured with 128 MB of data cache, and no caching is done on the clients. Storage-nodes use write-back caching, mimicking availability of 16 MB of non-volatile RAM. All experiments focus on the protocol costs: the working sets ﬁt into memory and all caches are warmed up beforehand. Results from such experiments highlight the overheads introduced by the protocol and not those introduced by the disk system. It is, however, a full system implementation: each storage-node is backed by a real persistent data store, and compulsory cache ﬂushes are serviced by the disk system.
Space-efﬁciency of protocol members
duces the network bandwidth needed, which reduces the response time of operations.
To perform a write operation, N data-fragments are sent over the network. With each data-fragment, a cross checksum and linkage record are sent. Respectively, these are N times the size of a MD5 digest (16 bytes) and N times the size of a storage-node ID (4 bytes). Thus, the network bandwidth consumed by cross checksums is 20 · N 2 bytes. RPC headers and arguments consume negligible bandwidth. Thus, the total amount of data sent over the network by a write operation is: 16 KB × N + 20 B × N 2.
Computation costs are incurred to erasure code data. Additional computation costs are incurred to authenticate messages and protect against non-crash failures. The majority of such computation costs are paid by clients in the system, rather than storage-nodes.
· 44 Efﬁcient, scalable consistency for highly fault-tolerant storage
Figure 3.8: Computational cost of erasure coding.
Block size, N, and m dictate the computational cost of erasure coding.
Erasure coding costs. Figure 3.8 shows the trends in the cost of encoding data with our erasure code implementation. For comparison, the performance of N-fold replication (i.e., N memcpys) is shown. Lines are shown for ﬁxed m values of two and three. These lines illustrate that, as expected, the cost of an erasure code for a given m grows linearly with N, since the number of code-fragments grows with N.
Two other lines are shown in Figure 3.8 to illustrate the interesting impact of m on performance: the space-efﬁciency of an erasure code is inversely proportional to m whereas the cost of generating some aggregate amount of code-fragment is proportional to m. ConN sider the m = line. For each point on the line, erasure coding generates, in total, 16 KB of code-fragments, although the number and individual sizes of the code-fragments differ.
When generating some aggregate amount of code-fragments, the cost of erasure coding grows linearly with m. For m = N − 1, a single code-fragment is needed for each write;
as expected, the cost of generating one fragment decreases with N, since the size of the fragment also decreases (to N−1 ).
Table 3.2: Client and storage-node computation costs.
Costs are broken down for the asynchronous repairable protocol member with Byzantine storage-nodes for: b = t = 1 and b = t = 4 (N = 5, m = 2 and N = 17, m = 5, respectively).
clients in the system. Erasure-coding is done by the client and requires nothing of the storage-node. The difference in computation costs for the two instances of the protocol member listed is due to their respective values of N and m. The cost of erasure coding with regard to N and m is discussed above. The cost of generating cross checksums grows N with m.
Read operations in protocol members with only crash clients are computationally less demanding than write operations. A read operation requires fewer hashes of datafragments and generation of fewer code-fragments. In the best case, the m stripe-fragments can be concatenated and no code-fragments need be generated. In protocol members that tolerate Byzantine clients, read operations performs almost the same computation as write operations to validate the cross checksum (i.e., N − m code-fragments are generated and N data-fragments hashes are taken).
Short secret sharing can be used in place of our default erasure code. Doing so adds · 46 Efﬁcient, scalable consistency for highly fault-tolerant storage ≈550 µs to the base erasure code costs for encrypting the data-item under the AES cipher and less than 20 µs for generating and secret sharing the encryption key (this cost depends on m and N). Both write and read operations incur these costs.