FREE ELECTRONIC LIBRARY - Abstracts, online materials

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |   ...   | 18 |

«Efficient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»

-- [ Page 6 ] --

It should be noted that, even after a write completes, it may be classified as repairable by a subsequent read, but it will never be classified as incomplete. For example, this could occur if the read set (of N − t storage-nodes) does not fully encompass the write set (of N − t storage-nodes).

3.4 Constraints

–  –  –

To classify a candidate as INCOMPLETE a client must determine that a complete write does not exist in the system (i.e., fewer than QC benign storage-nodes host the write). For this to be the case, the client must have queried all possible storage-nodes (N − t), and must assume that nodes not queried host the candidate in consideration. So,

–  –  –

3.4.2 Real repairable candidates To ensure that Byzantine storage-nodes cannot fabricate a repairable candidate, a candidate set of size b must be classifiable as incomplete. Substituting b into (3.2),

–  –  –

3.4.3 Decodable repairable candidates Any repairable candidate must be decodable. The lower bound on candidate sets that are repairable follows from (3.2) (since the upper bound on classifying a candidate as

incomplete coincides with the lower bound on repairable):

–  –  –

Since slow storage-nodes cannot be differentiated from crashed storage-nodes, only N −t responses can be awaited. As well, b responses received may be from Byzantine storagenodes.

3.4.5 Constraint summary The summary of constraints is given in Table 3.1. The bounds on N (i.e., N 2t + 2b) have been shown to be optimal for systems with single round-trip write operations [Abraham et al. 2004]. A diagram that more intuitively shows the constraint on N is shown in Figure 3.7.

3.5 Implementation PASIS consists of clients and storage-nodes. Storage-nodes store data-fragments and their versions. Clients execute the protocol to read and write data-items.

3.5.1 Storage-node implementation

–  –  –

Figure 3.7: Illustration of constraint on N.

This figure shows the intuitive reasoning behind the constraint on N = 2t + 2b + 1. As shown, a write executes at any N − t storage-nodes, and a read executes at a set of N − t storage-nodes that has the minimum overlap with the write. The read only observes the data value on the storage-nodes within the intersection of the read and write.

Since b of these storage-nodes may be Byzantine b + 1 matching values must be observed. This leads to an intersection of size 2b + 1. Thus, t + (2b + 1) + t = N = 2t + 2b + 1.

ganization to reduce the cost of data versioning. Experience indicates that retaining every version and performing local garbage collection comes with minimal performance cost (a few percent) and that it is feasible to retain complete version histories for several days [Soules et al. 2003; Strunk et al. 2000].

We extended CVFS to provide an interface for retrieving the logical timestamp of a data-fragment. Implicitly, each write request creates a new version of the data-fragment (indexed by its logical timestamp) at the storage-node. In addition to data, each write request contains a cross checksum, a logical timestamp, and a linkage record [Amiri et al.

1999]. The linkage record consists of descriptions of the encoding scheme, and addresses of the N storage-nodes for a specific data-item; it is fixed upon data-item creation.

By default, a read request returns the most current data-fragment version, ordered by logical timestamp. Read responses may also contain a limited version history containing logical timestamps of previously executed write requests. The version history allows clients to identify and classify additional candidates without issuing extra read requests.

Storage-node can also return read responses that contain no data other than version histories, which makes candidate classification more network-efficient.

· 40 Efficient, scalable consistency for highly fault-tolerant storage

3.5.2 Garbage collection

Pruning old versions, or garbage collection (GC), is necessary to prevent capacity exhaustion of the storage-nodes. A storage-node in isolation cannot determine which local data-fragment versions are safe to garbage-collect, because write completeness is a property of a set of storage-nodes. A data-fragment version can be garbage-collected only if there exists a later complete write for the corresponding data-item. Storage-nodes can classify writes by executing the read protocol in the same manner as a client. However, no data need be returned for protocol members that do not tolerate Byzantine clients (since the cross checksum need not be validated). Linkage records provide sufficient information for the storage-nodes to know which other nodes host relevant data-fragments.

Garbage collection is implemented in the current prototype and it requires no additional RPCs. We have implemented a heuristic to invoke GC whenever idle time is detected. Periodically, a thread wakes up and checks the system load. If the system load is low, then GC will be invoked. We use a timer-based idle-time detector, as described by Golding et al.[Golding et al. 1995]. This type of idle time detector was successfully used in cleaning heuristics for LFS, even in heavily loaded systems [Blackwell et al. 1995].

We also have a method for invoking GC externally of the system (e.g., by a CRON job).

It is usually inefficient to perform GC for every block, since most blocks do not have old versions that need to be reclaimed (e.g., in a system with a read-heavy workload). We address this by adding a counter to the PASIS per-block metadata to track the number of writes to a block. This counter is incremented during each write operation and reset after GC has run. If the the block’s write-count rises above certain threshold, an entry identifying the block is added to an in-memory high-write-count table. When GC is executed, it first searches this table. If an entry is found, it is removed and GC is executed on that block. If no entries are found, GC can scan the block-space sequentially. Although this heuristic works, further research into policy issues, such as the appropriate frequency and order of garbage collection, is warranted.


3.5 Implementation 41

3.5.3 Client implementation

Our client implementation follows the pseudo-code described above. The client module is accessed through a set of library interface calls. These calls allow an application to control the encoding scheme, the threshold values, and the failure and timing models.

The client protocol routines are implemented such that different protocol family members and thresholds may be specified for different data-items. Likewise, the storage-nodes for any given data-item are also specified via these interfaces, thus externalizing control (and responsibility) for such bootstrapping information; for our experiments we use a static set of N storage-nodes. Clients communicate with storage-nodes through a TCP-based RPC interface.

In an asynchronous environment, the client implementation issues TIME REQUEST requests to only N + b − QC storage-nodes, since this ensures overlap with the latest complete write. To improve the responsiveness of write operations, clients return after the first QC + b storage-nodes respond; the remainder of the requests complete in the background.

To improve the read operation’s performance, only m read requests fetch the latest data pertaining to the data-fragment, while all receive version histories; this makes the read operation more network-efficient. The limited data-fragment version history returned by read requests, allows clients to classify earlier writes without issuing additional storage-node requests. If necessary, after classification, extra data-fragments are fetched according to the candidate’s timestamp. Once the data-item is successfully validated and decoded, it is returned.

3.5.4 Erasure codes

In our erasure coding implementation, if m = 1, then replication is employed, otherwise an information dispersal algorithm [Rabin 1989] is used. Our information dispersal implementation stripes the data-item across the first m data-fragments (i.e., each data-fragment is of the original data-item’s size). This makes the erasure code a systematic encoding;

m Thus, concatenation of the first m data-fragments produce the original data-item. These · 42 Efficient, scalable consistency for highly fault-tolerant storage stripe-fragments are used to generate the code-fragments via polynomial interpolation within a Galois Field, which treats the stripe-fragments and code-fragments as points on some m − 1 degree polynomial. Our implementation of polynomial interpolation was originally based on [Dai 2003] (which conceptually follows [Rabin 1989]). We modified the source to use stripe-fragments and added an implementation of Galois Fields of size 28 that use lookup tables for multiplication.

Beyond our base erasure code implementation, we implemented secret sharing [Shamir 1979] and short secret sharing [Krawczyk 1994]. Our implementation of short secret sharing closely follows [Krawczyk 1994], using AES for the cipher. Such erasure codes can also provide a degree of confidentiality with regard to storage-nodes.

Our implementation of cross checksums closely follows Gong [Gong 1989]. Our implementation uses a publicly available implementation of MD5 [Rivest 1992] for all hashes. Each MD5 hash is 16 bytes long; thus, each cross checksum is N × 16 bytes long.

3.6 Evaluation This section evaluates protocol family performance in the context of the prototype block storage system.

3.6.1 Experimental setup

–  –  –

included in all experiments.

3.6.2 Performance and scalability of PASIS protocol members PASIS configuration Each storage-node is configured with 128 MB of data cache, and no caching is done on the clients. Storage-nodes use write-back caching, mimicking availability of 16 MB of non-volatile RAM. All experiments focus on the protocol costs: the working sets fit into memory and all caches are warmed up beforehand. Results from such experiments highlight the overheads introduced by the protocol and not those introduced by the disk system. It is, however, a full system implementation: each storage-node is backed by a real persistent data store, and compulsory cache flushes are serviced by the disk system.

Space-efficiency of protocol members

–  –  –

duces the network bandwidth needed, which reduces the response time of operations.

To perform a write operation, N data-fragments are sent over the network. With each data-fragment, a cross checksum and linkage record are sent. Respectively, these are N times the size of a MD5 digest (16 bytes) and N times the size of a storage-node ID (4 bytes). Thus, the network bandwidth consumed by cross checksums is 20 · N 2 bytes. RPC headers and arguments consume negligible bandwidth. Thus, the total amount of data sent over the network by a write operation is: 16 KB × N + 20 B × N 2.


Computation costs

Computation costs are incurred to erasure code data. Additional computation costs are incurred to authenticate messages and protect against non-crash failures. The majority of such computation costs are paid by clients in the system, rather than storage-nodes.

· 44 Efficient, scalable consistency for highly fault-tolerant storage

–  –  –

Figure 3.8: Computational cost of erasure coding.

Block size, N, and m dictate the computational cost of erasure coding.

Erasure coding costs. Figure 3.8 shows the trends in the cost of encoding data with our erasure code implementation. For comparison, the performance of N-fold replication (i.e., N memcpys) is shown. Lines are shown for fixed m values of two and three. These lines illustrate that, as expected, the cost of an erasure code for a given m grows linearly with N, since the number of code-fragments grows with N.

Two other lines are shown in Figure 3.8 to illustrate the interesting impact of m on performance: the space-efficiency of an erasure code is inversely proportional to m whereas the cost of generating some aggregate amount of code-fragment is proportional to m. ConN sider the m = line. For each point on the line, erasure coding generates, in total, 16 KB of code-fragments, although the number and individual sizes of the code-fragments differ.

When generating some aggregate amount of code-fragments, the cost of erasure coding grows linearly with m. For m = N − 1, a single code-fragment is needed for each write;

as expected, the cost of generating one fragment decreases with N, since the size of the fragment also decreases (to N−1 ).

–  –  –

Table 3.2: Client and storage-node computation costs.

Costs are broken down for the asynchronous repairable protocol member with Byzantine storage-nodes for: b = t = 1 and b = t = 4 (N = 5, m = 2 and N = 17, m = 5, respectively).

clients in the system. Erasure-coding is done by the client and requires nothing of the storage-node. The difference in computation costs for the two instances of the protocol member listed is due to their respective values of N and m. The cost of erasure coding with regard to N and m is discussed above. The cost of generating cross checksums grows N with m.

Read operations in protocol members with only crash clients are computationally less demanding than write operations. A read operation requires fewer hashes of datafragments and generation of fewer code-fragments. In the best case, the m stripe-fragments can be concatenated and no code-fragments need be generated. In protocol members that tolerate Byzantine clients, read operations performs almost the same computation as write operations to validate the cross checksum (i.e., N − m code-fragments are generated and N data-fragments hashes are taken).

Short secret sharing can be used in place of our default erasure code. Doing so adds · 46 Efficient, scalable consistency for highly fault-tolerant storage ≈550 µs to the base erasure code costs for encrypting the data-item under the AES cipher and less than 20 µs for generating and secret sharing the encryption key (this cost depends on m and N). Both write and read operations incur these costs.

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |   ...   | 18 |

Similar works:

«University of Iowa Iowa Research Online Theses and Dissertations Spring 2009 Acute neural adaptations to resistance training performed with low and high rates of muscle activation Clayton Robert Peterson University of Iowa Copyright 2009 Clayton Robert Peterson This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/257 Recommended Citation Peterson, Clayton Robert. Acute neural adaptations to resistance training performed with low and high rates of muscle activation....»

«LE LIT DE PROCUSTE DU MÊME AUTEUR AUX ÉDITIONS LES BELLES LETTRES Le Hasard sauvage. Comment la chance nous trompe, 2005. Le Cygne noir. La puissance de l’imprévisible, 2008. Le Cygne noir. La puissance de l’imprévisible, édition de poche, 2010. Force et fragilité. Réflexions philosophiques et empiriques, 2010. NASSIM NICHOLAS TALEB LE LIT DE PROCUSTE Aphorismes traduits par l’auteur avec la collaboration de Laure de Chantal et Alexis Grégoire Sainte Marie. PARIS LES BELLES...»

«XXIII World Congress of Philosophy, Athens 2013 Comparative and intercultural philosophy The Harmonization of Opposites Yannis Fikas, Ph.D. Director of Recruitment Development, New York College, Athens Email: yfikas@nyc.gr http://www.nyc.gr/ The theory of the harmonization of opposites as reflected in Western and Eastern philosophy is presented in this paper. Initially, the theory of the harmonization of opposites is found in the philosophical views of the Presocratic philosophers, Plato, the...»

«From Pariah to Parrhesiastes: Reconceptualising the Whistleblower in a Complex World by Julio Anthony Andrade Thesis presented in partial fulfilment of the requirements for the degree Master of Philosophy (Applied Ethics) at the University of Stellenbosch Supervisor: Dr Minka Woermann Faculty of Arts and Social Sciences Department of Philosophy December 2011 Stellenbosch University http://scholar.sun.ac.za Declaration By submitting this thesis/dissertation electronically, I declare that the...»

«Know Thyself and Your Followers Leadership Advance Online– Issue XVII, Summer 2009 by J. Alan Marshall and Corné J. Bekker Whether credited to the Oracle of Delphi or to Greek philosophers such as Plato et al., the importance of knowing oneself has been emphasized over the ages (Classics, 2008). Mirvis (1982) applied this axiom to research endeavors by proposing that in order to have any hope of meaningful and ethical research, researchers must adhere to this admonition with the variation,...»

«Boston University OpenBU http://open.bu.edu Theses & Dissertations Boston University Theses & Dissertations 2015 Enzinas to Valera: motives, methods and sources in sixteenth-century Spanish Bible translation Hasbrouck, Peter http://hdl.handle.net/2144/15178 Boston University BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES Dissertation ENZINAS TO VALERA: MOTIVES, METHODS AND SOURCES IN SIXTEENTH-CENTURY SPANISH BIBLE TRANSLATION by PETER W. HASBROUCK B.A., Moody Bible Institute, 1992...»

«WHAT DOES HOLISM HAVE TO DO WITH MORAL PARTICULARISM? Sean McKeever and Michael Ridge Abstract Moral particularists are united in their opposition to the codification of morality, and their work poses an important challenge to traditional ways of thinking about moral philosophy. Defenders of moral particularism have, with near unanimity, sought support from a doctrine they call “holism in the theory of reasons.” We argue that this is all a mistake. There are two ways in which holism in the...»

«This is a preprint of an article whose final and definitive form will be published in the Australasian Journal of Philosophy 93(2): 389-394. Please cite the published version. Non-Agential Permissibility in Epistemology∗ Luis R. G. Oliveira University of Massachusetts, Amherst January 2015 Paul Silva has recently argued that doxastic justification does not have a basing requirement. An important part of his argument depends on the assumption that doxastic and moral permissibility have a...»

«BIOCOSMOLOGY – NEO-ARISTOTELISM Abbreviated key title: Biocosmol. – neo-Aristot. Parallel title: Биокосмология – нео-Аристотелизм Bilingual Electronic Journal of Universalizing Scientific and Philosophical Research based upon the Original Aristotelian Cosmological Organicism ISSN: 2225-1820 Volume 6, Number 2, Spring 2016 Official organ of the Biocosmological Association – http://en.biocosmology.ru/ Place and time of origination: at the Novgorod State...»

«Multivalent Documents: Anytime, Anywhere, Any Type, Every Way User-Improvable Digital Documents and Systems by Thomas Arthur Phelps B.S. (University of Illinois at Urbana-Champaign) 1990 A.B. (University of Illinois at Urbana-Champaign) 1990 M.S. (University of California, Berkeley) 1993 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA, BERKELEY Committee in...»

«2014-2015 GradLink Mentoring Programme School of Social Sciences and Philosophy Connecting Graduates & Students Graduate Mentor Profiles p6 p4 BANKING & FINANCIAL SERVICES p7 p5 BUSINESS p8 MANAGEMENT CONSULTING TECHNOLOGY MARKETING JOURNALISM EDUCATION p9 VOLUNTARY/COMMUNITY p10 PUBLIC SERVICE p10 SALES p10 HEALTHCARE p5 Programme Launch Order of Events pm Students and Mentors arrive 7:30pm Welcome 7:40pm Mentors share their career experiences 8:20-9:15pm Mentors and students meet informally...»

«VERNACULAR SONG FROM A NORTH YORKSHIRE HILL FARM: CULTURE, CONTEXTS AND COMPARISONS Two volumes Volume I Submitted to the University of Newcastle for the degree of Doctor of Philosophy DAVID HILLERY International Centre for Music Studies Volume I CONTENTS Illustrations v Figures v Maps vi Tables vi Acknowledgements vii Abstract viii Preface ix Introduction xvi Thesis structure xvi Problems in transcription xviii Transcription in this thesis xxi Technical data xxii Terminology xxiii...»

<<  HOME   |    CONTACTS
2017 www.abstract.dislib.info - Abstracts, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.