FREE ELECTRONIC LIBRARY - Abstracts, online materials

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 18 |

«Efficient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»

-- [ Page 4 ] --

This section describes work related to building scalable metadata services. Numerous previous systems have focused on horizontally scaling data services through the addition of storage-nodes to obtain high data throughput. However, most systems either utilize a centralized metadata service or partition metadata across a set of servers such that each piece of metadata is handled by a single metadata server. The former approach is limited in its ability to scale, and both approaches render a metadata operation susceptible to a fault or compromise of the server responsible for it.

For example, NASD [Gibson et al. 1998] and Swift [Cabrera and Long 1991] centralize access to a metadata server. IBM’s Storage Tank [Menon et al. 2003] and Lustre [Braam 2004] replace the central metadata server with a cluster of servers, partitioning metadata across the servers while supporting server fail-over. Likewise, some systems partition certain metadata structures (e.g., the manager map in xFS [Anderson et al. 1996] and the lock table in Frangipani [Thekkath et al. 1997]). Other systems make use of distributed protocols that communicate among the metadata servers to provide a replicated, fault-tolerant metadata service (e.g., Paxos [Lamport 1998] in Frangipani and BFT [Castro and Liskov 2002] in Farsite [Adya et al. 2002]) and OceanStore [Kubiatowicz et al.

2000]. Lastly, in some systems the storage-devices export interfaces directly to the client that provide serialized access to the device (e.g., device-served locks in GFS [Soltis et al.

1996] and base storage transactions by Amiri et al. [Amiri et al. 2000b]).

Survivable file systems have typically focused on the use of Byzantine fault-tolerant replication to protect the metadata service (e.g., [Deswarte et al. 1991]). Modern examples such as Farsite [Adya et al. 2002], OceanStore [Kubiatowicz et al. 2000], and BFS [Castro and Liskov 2002] employ state machine replication [Schneider 1990] for this purpose.

While a powerful paradigm, state machine replication suffers from fundamental scaling limitations. First, all service nodes process all requests, so update throughput generally does not improve with additional nodes. Second, since the message complexity in their underlying agreement protocols is θ(n2 ) with n replicas, the effect of adding nodes can be ·

2.3 Consistency semantics 21 to degrade metadata update throughput. As such, adding replicas to the service group is of limited value: the throughput of read-only operations may improve, but the throughput of update operations at best would remain constant.

Consequently, to allow the metadata service in, e.g., Farsite to scale, the file system name-space is partitioned across multiple metadata services [Adya et al. 2002]. However, partitioning the name-space introduces another difficulty, namely implementing metadata operations atomically across replica groups, particularly in a manner resilient to Byzantine servers and clients. We are aware of no metadata service implementation that achieves this.

Our protocols employ a different paradigm that permits better load-balancing of requests across servers and linear-or-better message complexity per client request, and thus better ability to scale throughput as new servers are added. Rather than partitioning the name-space, we implement all metadata operations with a single replica group, and scale via lighter-weight access protocols than those implementing state machine replication. In the spirit of quorum protocols [Malkhi et al. 2000; Naor and Wool 1998], our approach permits clients to involve only a subset of servers in each operation (with no server-toserver communication). In particular, each read or update operation need only execute on a subset of metadata-nodes. Since all metadata operations are served in the same replica group, our approach can implement any metadata operation atomically. Thus, our metadata objects are, in effect, replicated state machines.

Extending our conditional write protocol to send update operations and to receive operation results, rather than sending and receiving whole objects, is efficient for objects with large state (e.g., directory objects). The optimistic nature of the conditional write protocol distinguishes it from other Byzantine quorum protocols. However, the protocol does not achieve the lower bound on N for implementing a Byzantine-tolerant replicated state-machine (i.e., N ≥ 4b + 1 [Malkhi and Reiter 1998a]).

The protocols developed in this thesis are most closely related to threshold-quorum systems (i.e., a majority voting system [Gifford 1979; Thomas 1979]), though our approach offers opportunities for exploring use of richer quorum constructions (e.g., [Malkhi · 22 Efficient, scalable consistency for highly fault-tolerant storage and Reiter 1998a; Malkhi et al. 2000]). In a threshold-quorum system, the load [Naor and Wool 1998; Malkhi et al. 2000] on each storage-node is at least one half. This means that each storage-node must execute requests for at least one half of the operations applied to objects it hosts. True Byzantine quorum systems [Malkhi and Reiter 1998a] scale better than the one half bound. If Byzantine quorum construction techniques such as the M-Path construction [Malkhi et al. 2000] are employed, then the lower bound on load is Ω( b N ).

3 Read/Write Block Protocol This chapter describes and evaluates a new consistency protocol that operates in an asynchronous environment and tolerates Byzantine failures of clients and storage-nodes. The protocol supports a hybrid failure model in which up to t storage-nodes may fail: b ≤ t of these failures can be Byzantine and the remainder can be crash. The protocol also supports use of m-of-n erasure codes (i.e., m-of-n fragments are needed to reconstruct the data), which usually require less network bandwidth (and storage space) than full replication [Weatherspoon and Kubiatowicz 2002; Wylie et al. 2000].

Briefly, the protocol works as follows. To perform a write, a client determines the current logical time (by querying a subset of the storage-nodes) and then writes timestamped fragments to at least a threshold quorum of storage-nodes. Storage-nodes keep all versions of fragments they are sent until garbage collection frees them. To perform a read, a client fetches the latest fragment versions from a threshold quorum of storagenodes and determines whether they comprise a completed write; usually, they do. If they do not, additional and historical fragments are fetched, and repair may be performed, until a completed write is observed.

The protocol gains efficiency from five features. First, the space-efficiency of m-of-n erasure codes can be substantial, reducing communication overheads significantly. Second, most read operations complete in a single round trip: reads that observe write concurrency or failures (of storage-nodes or a client write) may incur additional work. Most studies of distributed storage systems (e.g., [Baker et al. 1991; Noble and Satyanarayanan 1994]) indicate that concurrency is uncommon (i.e., writer-writer and writer-reader shar· 24 Efficient, scalable consistency for highly fault-tolerant storage ing occurs in well under 1% of operations). Failures, although tolerated, ought to be rare.

Third, incomplete writes are replaced by subsequent writes or reads (that perform repair), thus preventing future reads from incurring any additional cost; when subsequent writes do the fixing, additional overheads are never incurred. Fourth, most protocol processing is performed by clients, increasing scalability via the well-known principle of shifting work from servers to clients [Howard et al. 1988]. Fifth, the protocol only requires the use of cryptographic hashes, rather than more expensive cryptographic primitives (e.g., digital signatures).

This chapter describes the protocol in detail, develops bounds for thresholds in terms of the number of failures tolerated (i.e., the protocol requires at least 2t + 2b + 1 storagenodes), and provides a proof sketch of its safety and liveness. The protocol requires at least 2t + 2b + 1 storage-nodes (i.e., 4b + 1 if t = b). It also describes and evaluates its use in a prototype storage system called PASIS [Wylie et al. 2000]. To demonstrate that our protocol is efficient in practice, we compare its performance to BFT [Castro and Liskov 2001; 2002], the Byzantine fault-tolerant replicated state machine implementation that Castro and Liskov have made available [Castro and Rodrigues 2003]. Experiments show that PASIS scales better than BFT in terms of network utilization at the server and in terms of work performed by the server. Experiments also show that response times of PASIS and BFT are comparable.

3.1 System model

–  –  –

tine clients). We assume that Byzantine clients and storage-nodes are computationally bounded so that we can benefit from cryptographic primitives (i.e., cryptographic hash functions).

The protocol is developed with a hybrid storage-node failure model [Thambidurai and Park 1988]. Under a traditional hybrid failure model, up to t storage-nodes could fail, b ≤ t of which may be Byzantine faults; the remainder could only crash. However, we consider a hybrid failure model for storage-nodes that crash and recover. The crashrecovery failure model is a strict generalization of the omission and crash failure models.

First, we review the crash-recovery model from Aguilera et al. [Aguilera et al. 2000].

In a system of n processes, each process can be classified as always-up, eventually-up, eventually-down, or unstable. A process that is always-up never crashes. A process that is eventually-up crashes at least once, but there is a time after which it is permanently up.

A process that is eventually-down crashes at least once, and there is a time after which it is permanently down. A process that is unstable crashes and recovers infinitely many times. These classifications are further refined: a process is good if it is either always-up or eventually-up.

We combine the crash-recovery model with the hybrid failure model as follows. Up to b storage-nodes may ever be Byzantine; such storage-nodes do not recover and are not good. There are at least N − t good storage-nodes (where b ≤ t). A storage-node that is not Byzantine is said to be benign (i.e., benign storage-nodes are either always-up, eventually-up, eventually-down, or unstable). We assume that storage-nodes have stable storage that persists throughout the crash and recovery process.

The protocol tolerates crash and Byzantine clients. As in any practical storage system, an authorized Byzantine client can write arbitrary values to storage. These writes only affect the value of the data, but do not compromise the safety (linearizability) of the object. A client that does not exhibit a Byzantine failure (it is either correct or crashes) is benign We assume an asynchronous model of time (i.e., we make no assumptions about message transmission delays or the execution rates of clients and storage-nodes, except that it · 26 Efficient, scalable consistency for highly fault-tolerant storage is non-zero). We assume point-to-point authenticated channels with properties similar to those used by Aguilera et al. [Aguilera et al. 2000]. In summary, channels do not create messages (no creation), channels may experience finite duplication, and channels are fair loss. The finite duplication property ensures that if benign process p sends a message to benign process q only a finite number of times, then q receives the message only a finite number of times. The fair loss property ensures that if benign process p sends infinitely many messages to good process q, then q receives infinitely many messages from p.

There are two types of operations in the protocol — read operations and write operations — both of which operate on data-items. Clients perform read/write operations that issue multiple read/write requests to storage-nodes. A read/write request operates on a data-fragment. A data-item is encoded into data-fragments. Clients may encode data-items in an erasure-tolerant manner; thus the distinction between data-item and datafragment. Requests are executed by storage-nodes; a correct storage-node that executes a write request hosts that write operation.

Clients may encode data-items in an erasure-tolerant manner; thus the distinction between data-item and data-fragment. We focus here on threshold erasure codes in which any m of the n encoded data-fragments can decode the data-item. When m = 1, the replication is used. Examples of such codes are replication, Reed-Solomon codes [Berlekamp 1968], secret sharing [Shamir 1979], RAID 3/4/5/6 [Patterson et al. 1988], information dispersal (IDA) [Rabin 1989], short secret sharing [Krawczyk 1994], and “tornado” or LDPC codes [Luby et al. 2001].

Storage-nodes provide fine-grained versioning; correct storage-nodes host a version of the data-fragment for each write request they execute. There is a well known zero time, 0, and null value, ⊥, which storage-nodes can return in response to read requests.

Implicitly, all stored data is initialized to ⊥ at time 0.


3.2 Mechanisms 27 Figure 3.1: Example of cross checksum generation for 5 data-fragments. To generate a cross checksum, a cryptographic hash is taken of each data-fragment. These hashes are then concatenated, replicated, and stored with each data-fragment.

3.2 Mechanisms This section describes mechanisms employed for encoding data, preventing Byzantine clients and storage-nodes from violating consistency, and authenticating client and storagenode requests. We assume that storage-nodes and clients are computationally bounded such that cryptographic primitives can be effective.

3.2.1 Erasure codes

–  –  –

erasure codes (e.g., Secret Sharing [Shamir 1979] and Short Secret Sharing [Krawczyk 1994]) work as well.

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 18 |

Similar works:


«Rome: Ancient by Eugene Rice Encyclopedia Copyright © 2015, glbtq, Inc. Entry Copyright © 2004, glbtq, inc. Reprinted from http://www.glbtq.com When Romans write about love, they normally are even-handed: For Ovid (43 B.C.E.-17 C.E.) love can be inspired by a boy or a girl (aut puer aut. puella) and Lucretius (94-55 B.C.E.) sees no difference whether boy or women (sive puer. seu mulier). Musonius Rufus (ca 30-102 C.E.) taught Stoic moral philosophy at Rome during the reigns of Nero and...»

«Pensive Pondering Posts, Jason Brownlee Pensive Pondering Posts This document contains all blog posts from my former blog ‘Pensive Pondering’ that was online from August 2005 to approximately September 2006 at the URL http://pensive-pondering.blogspot.com. This document was created from the first version of posts emailed to myself via the post notification feature of blogger during that time. Posts are left in their original form without changes to spelling, gramma, or formatting. Comments...»

«Original citation: Ansell-Pearson, Keith (2015) Beyond selfishness : epicurean ethics in Nietzsche and Guyau. In: Bamford, Rebecca, (ed.) Nietzsche’s Free Spirit Philosophy. Lanham : Rowman & Littlefield International, pp. 49-69. ISBN 9781783482191 Permanent WRAP URL: http://wrap.warwick.ac.uk/78574 Copyright and reuse: The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and...»

«General and Specific Avoidance Coping: The Development and Validation of a New Scale. A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy by Leendert Johannes (Lehan) Stemmet University of Canterbury College of Science “In our privileged lives, we are uniquely smart enough to have invented these stressors and uniquely foolish enough to have let them, too often, dominate our lives. Surely we have the potential to be uniquely wise enough to banish...»

«PROPOSITIONAL LOGIC OF SUPPOSITION AND ASSERTION1 1. LANGUAGE AND SPEECH ACTS In this paper I develop a system of what I understand to be illocutionary logic. In order to motivate this system and make it intelligible to the reader, I briefly sketch a philosophical basis for the system. This sketchy foundation will not be argued for, because there is no room for that in the present paper. However, it should be clear how this foundation “gives rise to” the logical system that is developed. On...»

«Proceedings of the first International Conference on Systems Thinking in Management, Geelong, Australia, 2000, pp 532-537. Complexity Science: A ‘Grey’ Science for the ‘Stuff in Between’ Kurt A. Richardson1, Paul Cilliers2, and Michael Lissack3 Institute for the Study of Coherence and Emergence, Boston, USA E-mail: kurt@kur trichardson.com Department of Philosophy, University of Stellenbosch, Stellenbosch 7600, South Africa E-mail: fpc@akad.sun.ac.za Institute for the Study of...»

«Four Sermons in Defiance of the Nazis Preached During 1941 by Bischop von Galen of Münster Clemens August, Count von Galen Clemens August, Count von Galen, was born on 16th March 1878 in Burg Dinklage in Oldenburg. As the eleventh of thirteen children he grew up in the safeness of a deeply religious family. He attended the secondary school of the Jesuits in Feldkirch and obtained his leaving certificate at Vechta in 1896. After studying philosophy at Fribourg (Switzerland) for a short period,...»

«Lurking in email-based discussion lists Robert Blair Nonnecke A thesis submitted in partial fulfillment of the requirements of South Bank University for the degree of Doctor of Philosophy March 2000 © R. B. Nonnecke Acknowledgments I would like to extend my gratitude to a number of people. Any project of this magnitude requires a wide range of talents. I am indebted to Thawatchai Piyawat and Dick Seabrook for their consummate programming skills, and to Jantawan Noiwan for her statistical...»

«Travels to terra incognita: The Scottish Highlands and Hebrides in early modern travellers’ accounts c. 1600 to 1800 Dissertation zur Erlangung des Doktorgrades der Philosophischen Fakultät der Christian-Albrechts-Universität zu Kiel vorgelegt von Martin Rackwitz Kiel Band I Erstgutachter: Prof. Dr. Thomas Riis Zweitgutachter: Prof. Dr. Allan I. Macinnes Tag der mündlichen Prüfung: 08.12.2004 Durch den zweiten Prodekan, Prof. Dr. Norbert Nübler, zum Druck genehmigt am: 14.02.2005 ‘Wenn...»

«Brown / PORTFOLIOS AND ADULT EDUCATION QUARTERLY / May 2002 ADULT LEARNING KNOW THYSELF: THE IMPACT OF PORTFOLIO DEVELOPMENT ON ADULT LEARNING JUDITH O. BROWN Barry University Thousands of years ago the Greeks carved above their temples the phrase “know thyself,” two simple words that imply a lifetime of investigation. Throughout the ages philosophers and scholars have emphasized the importance of self-knowledge as an outcome of learning. One teaching strategy that facilitates...»

«Kierkegaard & Nietzsche: Two Different Passions Nothing fascinates man more than man himself. Of all the problems he has had to face, none have been more elusive and hard to solve than the problem of what it means to be human. It seems that no two philosophers have the same answer. It would not be hard to believe that no answer exists, but that does not stop men from seeking one. The philosophy that seems best suited for the task is existentialism, which primarily concerns itself with the...»

<<  HOME   |    CONTACTS
2017 www.abstract.dislib.info - Abstracts, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.