FREE ELECTRONIC LIBRARY - Abstracts, online materials

Pages:     | 1 |   ...   | 5 | 6 || 8 | 9 |   ...   | 18 |

«Efficient, scalable consistency for highly fault-tolerant storage GARTH GOODSON August 2004 CMU-CS-04-111 Dept. of Electrical and Computer ...»

-- [ Page 7 ] --

–  –  –

in creating the cross checksum. A hash of the cross checksum is taken to verify the hash within the timestamp. Read requests require no significant computation by the storagenode (for the protocol).

AUTHENTICATION: Clients and storage-nodes must authenticate each RPC request and response. Authentication is performed over the RPC header and some RPC arguments.

Cross checksums and data-fragments are not directly included in the authentication; however, the validating timestamp is included, and it indirectly authenticates the remainder.

In all cases, authentication of an RPC message requires less than 2.5 µs.

3.6.3 Performance and scalability comparison with BFT BFT configuration We compare the PASIS implementation of our protocol with the BFT library implementation [Castro and Rodrigues 2003] of the BFT protocol for replicated state machines [Castro and Liskov 1998a], since it is generally regarded as efficient. The semantics provided by BFT are stronger than those provided by the PASIS read/write protocol. Since BFT implements a Byzantine fault-tolerant replicated state machine, arbitrary operations are linearizable, not just block reads and writes (as in PASIS). Additionally, the bounds on the number of storage-nodes required to tolerate an equivalent number of faults are lower than in PASIS (3b + 1 for BFT as compared with 4b + 1 in PASIS). However, BFT incurs the cost of multiple rounds of server communication.

Operations in BFT require agreement among the replicas (storage-nodes in PASIS).

Agreement is performed in four steps: (i) the client broadcasts requests to all replicas; (ii) the primary broadcasts pre-prepare messages to all replicas; (iii) all replicas broadcast ·

3.6 Evaluation 47 prepare messages to all replicas; and, (iv) all replicas send replies back to the client and then broadcast commit messages to all other replicas. Commit messages are piggy-backed on the next pre-prepare or prepare message to reduce the number of messages on the network. Authenticators, lists of MACs, are used to ensure that broadcast messages from clients and replicas cannot be modified by a Byzantine replica. All clients and replicas have public and private keys that enables them to exchange symmetric cryptography keys used to create MACs. Logs of commit messages are checkpointed (garbage collected) periodically. View changes, in which a new primary is selected, are suppressed in all experiments.

An optimistic fast path for read operations (i.e., operations that do not modify state) is implemented in BFT. The client broadcasts its request to all replicas. Each replica replies once all messages previous to the request are committed. Only one replica sends the full reply (i.e., the data and digest), and the remainder just send digests that can verify the correctness of the data returned. If the replies from replicas do not agree, the client reissues the read operation—for the replies to agree, the read-only request must arrive at 2b + 1 of the replicas in the same order (with regard to other write operations). Re-issued read operations perform agreement using the base BFT algorithm.

The BFT configuration does not store data to disk: instead, it stores all data in memory and accesses it via memory offsets (i.e., we implemented a simple block interface using BFT). As such, the storage-component of BFT is much faster than PASIS’ storage component, which provides true disk-based storage. The difference in storage-component latency can be seen in the BFT vs. PASIS breakdown graph shown in Figure 3.10.

BFT uses UDP connections rather than TCP. BFT’s retransmission policy is static (i.e., it does not adapt with the detection of congestion as does TCP’s policy) and can only be set at a coarse granularity (milliseconds). We have observed high rates of retransmission at high load when running write workloads using BFT on our LAN. We believe this causes the dropoff in write throughput as shown in Figure 3.11. Due to the BFT library’s code structure, it would require significant work to change the transport to use TCP instead of UDP in order to achieve a fairer comparison.

· 48 Efficient, scalable consistency for highly fault-tolerant storage PASIS reads (b=t) PASIS writes (b=t) PASIS reads hybrid (b=1) PASIS writes hybrid (b=1) 16 BFT reads w/o mcast (b=t)

–  –  –

Figure 3.9: Mean response time vs.

total failures tolerated. Mean response times of read and write operations of random 16 KB blocks in PASIS and BFT. Lines are shown for PASIS that correspond to both b = t and b = 1 (a hybrid fault model). Multicast was not used for these BFT experiments.

The BFT implementation defaults to using IP multicast. In our environment, like many, IP multicast broadcasts to the entire subnet, thus making it unsuitable for shared environments. We found that the BFT implementation code is fairly fragile when using IP multicast in our environment, making it necessary to disable IP multicast in some cases (where stated explicitly). The BFT implementation authenticates broadcast messages via authenticators, and point-to-point messages with MACs.

Response time

–  –  –

Figure 3.10: Protocol cost breakdown.

The bars illustrate the cost breakdown of read and write operations for PASIS and BFT for b = 1 and b = 4. Each bar corresponds to a single point on the mean response time graph in Figure 3.9. BFT does not store data to disk, as such no server storage cost is shown for BFT.

b = 4. Since measurements are taken at the user-level, kernel-level timings for host network protocol processing (including network system calls) are attributed to the “network” cost of the breakdowns. To understand the response time measurements and scalability of these protocols, it is important to understand these breakdowns.

PASIS has better response times than BFT for write operations due to the spaceefficiency of erasure codes and the nominal amount of work storage-nodes perform to execute write requests. For b = 4, BFT has a blowup of 13× on the network (due to = 3.4× on the network. With IP replication), whereas our protocol has a blowup of 5 multicast the response time of the BFT write operation would improve significantly, since the client would not need to serialize 13 replicas over its link. However, IP multicast does not reduce the aggregate server network utilization of BFT—for b = 4, 13 replicas must be delivered.

PASIS has longer response times than BFT for read operations. This can be attributed to two main factors: First, the PASIS storage-nodes store data in a real file system; since the BFT-based block store keeps all data in memory and accesses blocks via memory offsets, it incurs almost no server storage costs. We expect that a BFT implementation with actual data storage would incur server storage costs similar to PASIS (e.g., around · 50 Efficient, scalable consistency for highly fault-tolerant storage

0.7 ms for a write and 0.4 ms for a read operation, as is shown for PASIS with b = 1 in Figure 3.10). Indeed, the difference in read response time between PASIS and BFT at b = 1 is mostly accounted for by server storage costs. Second, for our protocol, the client computation cost grows as the number of failures tolerated increases because the cost of generating data-fragments grows as N increases.

In addition to the b = t case, Figure 3.9 shows one instance of PASIS assuming a hybrid fault model with b = 1. For space-efficiency, we set m = t + 1. Consequently, QC = 2t + 1 and N = 3t + 2. At t = 1, this configuration is identical to the Byzantineonly configuration. As t increases, this configuration is more space-efficient than the Byzantine-only configuration, since it requires t − 1 fewer storage-nodes. As such, the response times of read and write operations scale better.

Some read operations in PASIS can require repair. A repair operation must perform a “write” operation to repair the value before it is returned by the read. Interestingly, the response time of a read that performs repair is less than the sum of the response times of a normal read and a write operation. This is because the “write” operation during repair does not need to read logical timestamps before issuing write requests. Additionally, data-fragments need only be written to storage-nodes that do not already host the write operation.


–  –  –

Figure 3.11: Throughput vs.

number of clients (b = 1). Throughput of read and write operations of random 16 KB blocks in PASIS and BFT for b = 1. Each client had one request outstanding. For PASIS, lines corresponding to both m = 2, N = 4 and m = 3, N = 5 are shown. For BFT, multicast was used.

accesses. At this point, the disk subsystem would become the bottleneck.

At high load, PASIS has greater write throughput than BFT. BFT’s write throughput peaks at 456 requests per second. But, we observed BFT’s write throughput drops off significantly as client load increased; during these drop-offs, we observed a large increase in request retransmissions. We believe that this is due to the use of UDP and a coarsegrained retransmit policy in BFT’s implementation. The write throughput of PASIS begins to flatten out at 675 requests per second for m = 2 and 806 req/sec for m = 3, significantly outperforming BFT. PASIS provides higher write throughput than BFT, because server links become bottlenecks, even though multicast is used.

Even with multicast enabled, each BFT server link sees a full 16 KB replica, whereas each PASIS server link sees KB. Similarly, due to network space-efficiency, the PASIS m configuration using m = 3 outperforms the minimal PASIS configuration (806 requests per second). Both PASIS and BFT have roughly the same network utilization per read operation (16 KB per operation). To be network-efficient, PASIS uses read witnesses and BFT uses “fast path” read operations. However, PASIS makes use of more storage-nodes than BFT does servers. As such, the aggregate bandwidth available for reads is greater for PASIS than for BFT, and consequently PASIS has a greater read throughput than · 52 Efficient, scalable consistency for highly fault-tolerant storage BFT. Although BFT could add servers to increase its read throughput, doing so would not increase its write throughput (indeed, write throughput would likely drop due to the extra inter-server communication).

BFT vs PASIS scalability summary For PASIS and BFT, scalability is limited by either the server network utilization or server CPU utilization. Figure 3.10 shows that PASIS scales better than BFT in both. Consider write operations. Each BFT server receives an entire replica of the data, whereas each PASIS storage-node receives a data-fragment the size of a replica. The work performed m by BFT servers for each write request grows with b. In PASIS, the server protocol cost decreases from 90 µs for b = 1 to 57 µs for b = 4, whereas in BFT it increases from

0.80 ms to 2.1 ms. The server cost in PASIS decreases because m increases as b increases, reducing the size of the data-fragment that is validated. We believe that the server cost for BFT increases because the number of messages that must be sent to all other servers increases.

3.6.4 Other results Garbage collection We assume a large window of storage version capacity, so garbage collection usually occurs during idle periods. But, even when it competes with real requests, garbage collection is inexpensive. Garbage collection requests are just batched read requests, except that no data need be returned for members that do not tolerate Byzantine clients. When Byzantine clients are tolerated, garbage collection must validate the cross checksum, which does require data-fragments.


–  –  –

client throughput when accessing overlapping block sets. The experiment consists of four clients, each with four operations outstanding. Each client accesses a range of eight data blocks, some overlapping with other clients and some not, and no outstanding requests from the same client going to the same block.

At the highest concurrency level—all eight blocks in contention by all clients—we observed neither significant drops in throughput nor significant increases in mean response time. For example, the asynchronous repairable protocol member classified the initial candidate as complete 88.8% of the time, and found repair was necessary only 3.3% of the time. Since repair occurs rarely, the effect on average response time and throughput is minimal.

Impact of faults Storage-node failures. For clients, storage-node failures have minimal impact on performance.

Client crash failures. Client crash failures appear as partially written data. Subsequent reads may observe these writes as incomplete or unclassifiable. If they are unclassifiable, the read must either abort or attempt repair. Repair adds much of the cost of performing a write, though, the round-trip to obtain a logical timestamp in an asynchronous system is not needed.

3.7 Discussion 3.7.1 Byzantine clients In a storage system, Byzantine clients can write arbitrary values. The use of fine-grained versioning (e.g., self-securing storage [Strunk et al. 2000]) facilitates detection, recovery, and diagnosis from storage intrusions [Strunk et al. 2002]. Once discovered, arbitrarily modified data can be rolled back to its pre-corruption state.

Byzantine clients can also attempt to exhaust the resources available to the PASIS protocol. Issuing an inordinate number of write operations could exhaust storage space.

· 54 Efficient, scalable consistency for highly fault-tolerant storage However, continuous garbage collection frees storage space prior to the latest complete write. If a Byzantine client were to intentionally issue incomplete write operations, then garbage collection may not be able to free up space. In addition, incomplete writes require read operations to roll-back behind them, thus consuming client computation and network resources. In practice, storage-based intrusion detection [Pennington et al. 2003] is probably sufficient to detect such client actions.

3.7.2 Timestamps from Byzantine storage-nodes

Pages:     | 1 |   ...   | 5 | 6 || 8 | 9 |   ...   | 18 |

Similar works:

«Philosophy http://journals.cambridge.org/PHI Additional services for Philosophy: Email alerts: Click here Subscriptions: Click here Commercial reprints: Click here Terms of use : Click here On the Supposed Obligation to Relieve Famine John Kekes Philosophy / Volume null / Issue 04 / April 2002, pp 503 517 DOI: 10.1017/S0031819102000438, Published online: 28 October 2002 Link to this article: http://journals.cambridge.org/abstract_S0031819102000438 How to cite this article: John Kekes (2002). On...»

«Selecting Children: The Ethics of Reproductive Genetic Engineering Forthcoming in Philosophy Compass. © S. MATTHEW LIAO Faculty of Philosophy, Oxford University, Littlegate House, 16/17 St. Ebbes St., Oxford OX1 1PT, UK; e-mail: matthew.liao@philosophy.ox.ac.uk; www.smatthewliao.com 10 July 2008 Selecting Children: The Ethics of Reproductive Genetic Engineering Abstract Advances in reproductive genetic engineering have the potential to transform human lives. Not only do they promise to allow...»

«Ó Springer 2007 Law and Philosophy (2007) 26: 31–65 DOI 10.1007/s10982-005-5917-2 TAMAR MEISELS COMBATANTS – LAWFUL AND UNLAWFUL (Accepted 27 December 2005) The September 11 attacks led many Americans to believe that Al-Qaeda had plunged the U.S. into a new type of war, already familiar to some of the country’s closest allies. Subsequent debates over modern terrorism often involve a sort of lamentation for the passing of old-fashioned wars.1 Paul Gilbert’s New Terror, New Wars suggests...»

«A NOVEL NUMERICAL ANALYSIS OF HALL EFFECT THRUSTER AND ITS APPLICATION IN SIMULTANEOUS DESIGN OF THRUSTER AND OPTIMAL LOW-THRUST TRAJECTORY A Dissertation Presented to The Academic Faculty by Kybeom Kwon In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Aerospace Engineering Georgia Institute of Technology August 2010 COPYRIGHT © 2010 BY KYBEOM KWON A NOVEL NUMERICAL ANALYSIS OF HALL EFFECT THRUSTER AND ITS APPLICATION IN SIMULTANEOUS DESIGN OF...»


«TRUST MANAGEMENT IN DISTRIBUTED RESOURCE CONSTRAINED EMBEDDED SYSTEMS A Dissertation Presented by Peter C. Chapin to The Faculty of the Graduate College of The University of Vermont In Partial Fullfillment of the Requirements for the Degree of Doctor of Philosophy Specializing in Computer Science January, 2014 Accepted by the Faculty of the Graduate College, The University of Vermont, in partial fulllment of the requirements for the degree of Doctor of Philosophy, specializing in Computer...»

«1 (3/20/15) Custom, jus cogens, and human rights JOHN TASIOULAS* [Book chapter for CUSTOM’S FUTURE: INTERNATIONAL LAW IN A CHANGING WORLD (Curtis A. Bradley ed., forthcoming Cambridge University Press)] Immanuel Kant notoriously declared that it was a “scandal of philosophy” that it had not yet furnished us with a convincing proof of the existence of an external world. International lawyers have their equivalent occupational scandal: the failure to achieve clarity or consensus on the...»

«Ethics and Imperialism in Livy by Joseph Viguers Groves A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Classical Studies) in the University of Michigan Doctoral Committee: Professor David Potter Professor Sara Forsdyke Professor Katherine French Associate Professor Celia Schultz Professor Nicola Terrenato © Joseph Groves For my grandparents ii Acknowledgments Finishing this project would never have been possible without the support...»

«de vrijwillige slavernij Étienne de La Boétie Vertoog oVer de Vrijwillige SlaVernij vertaald door Charlotte Bauwens met een inleiding van dr. Jos Verhulst en een nawoord door dr. Frank van Dun Democratie.Nu vzw 2007 | www.democratie.nu Ontwerp cover en lay-out: Michaël Bauwens Dit werk heeft een Creative Commons Attribution-NonCommercialNoDerivs 2.0 Licentie. Dit betekent dat u de teksten vrij mag gebruiken, zolang u de bron en de auteur vermeldt, de tekst intact laat en er geen commercieel...»

«Magneto-hydrodynamics simulation in astrophysics by Bijia Pang A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Physics University of Toronto Copyright c 2011 by Bijia Pang Abstract Magneto-hydrodynamics simulation in astrophysics Bijia Pang Doctor of Philosophy Graduate Department of Physics University of Toronto Magnetohydrodynamics (MHD) studies the dynamics of an electrically conducting fluid under the influence of a...»

«Ontological Progress in Science Richard M. Burian Center for the Study of Science in Society and Department of Philosophy Virginia Polytechnic Institute and State University J.D. Trout Philosophy Department and the Parmly Hearing Institute Loyola University of Chicago Introduction In the grand tradition, philosophical ontology was considered logically and epistemologically prior to scientific ontology. Like many contemporary philosophers of science, we consider this a mistake. There is a...»

«The representation and perception of visual motion: to integrate or not to integrate by James H. Hedges A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Center for Neural Science New York University September, 2009 Eero P. Simoncelli J. Anthony Movshon UMI Number: 3380197 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the...»

<<  HOME   |    CONTACTS
2017 www.abstract.dislib.info - Abstracts, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.