Reputation: 14459
I've read Lamport's paper on Paxos. I've also heard that it isn't used much in practice, for reasons of performance. What algorithms are commonly used for consensus in distributed systems?
Upvotes: 19
Views: 4841
Reputation: 10667
Raft is more understandable, and faster alternative of Paxos. One of the most popular distributed systems which uses Raft is Etcd. Etcd is the distributed store used in Kubernetes.
It's equivalent to Paxos in fault-tolerance.
Upvotes: 0
Reputation: 970
There are two general blockchain consensus systems:
The first generation blockchain consensus algorithms (Proof of Work, Proof of Stake, and BitShares’ Delegated Proof of Stake) only offer high probability of finality that grows with time. In theory someone could pay enough money to mine an alternative “longer” Bitcoin blockchain that goes all the way back to genesis.
More recent consensus algorithms, whether HashGraph, Casper, Tendermint, or DPOS BFT
all adopt long-established principles of Paxos
and related consensus algorithms. Under these models it is possible to reach unambiguous finality under all network conditions so long as more than ⅔ of participants are honest.
Objective and unambiguous 100% finality is a critical property for all blockchains that wish to support inter-blockchain communication. Absent 100% finality, a reversion on one chain could have irreconcilable ripple effects across all interconnected chains.
The abstract protocol for these more recent protocols
involves:
It is the technical differences in the protocols that give rise to real-world impact on user experience. This includes things such as latency until finality, degrees of finality, bandwidth, and proof generation / validation overhead.
Look for more details on delegated proof of stake by eos here
Upvotes: 0
Reputation: 1896
Paxos is optimal in terms of performance of consensus protocols, at least in terms of the number of network delays (which is often the dominating factor). It's clearly not possible to reliably achieve consensus while tolerating up to f failures without a single round-trip communication to at least (f-1) other nodes in between a client request and the corresponding confirmation, and Paxos achieves this lower bound. This gives a hard bound on the latency of each request to a consensus-based protocol regardless of implementation. In particular, Raft, Zab, Viewstamped Replication and all other variants on consensus protocols all have the same performance constraint.
One thing that can be improved from standard Paxos (also Raft, Zab, ...) is that there is a distinguished leader which ends up doing more than its fair share of the work and may therefore end up being a bit of a bottleneck. There is a protocol known as Egalitarian Paxos which spreads the load out across multiple leaders, although it's mindbendingly complicated IMO, is only applicable to certain domains, and still must obey the lower bound on the number of round-trips within each request. See the paper "There Is More Consensus in Egalitarian Parliaments" by Moraru et al for more details.
When you hear that Paxos is rarely used due to its poor performance, it is frequently meant that consensus itself is rarely used due to poor performance, and this is a fair criticism: it is possible to achieve much higher performance if you can avoid the need for consensus-based coordination between nodes as much as possible, because this allows for horizontal scalability.
Snarkily, it's also possible to achieve better performance by claiming to be using a proper consensus protocol but actually doing something that fails in some cases. Aphyr's blog is littered with examples of these failures not being as rare as you might like, where database implementations have either introduced bugs into good consensus-like protocols by way of "optimisation", or else developed custom consensus-like protocols that fail to be fully correct in some subtle fashion. This stuff is hard.
Upvotes: 2
Reputation: 6822
With Multi-Paxos when the leader is galloping it can respond to the client write when it has heard that the majority of nodes have written the value to disk. This is as good and efficient as you can get to maintain the consistency guarantees that Paxos makes.
Typically though people use something paxos-like such as zookeeper as an external service (dedicated cluster) to keep critical information consistent (who has locked what, who is leader, who is in a cluster, what's the configuration of the cluster) then run a less strict algorithm with less consistency guarantees which relies upon application specifics (eg vector clocks and merged siblings). The short ebook distributed systems for fun and profit as a good overview of the alternatives.
Note that lots of databases compete on speed by using risky defaults which risk consistency and can loose data under network partitions. The Aphry blog series on Jepson shows whether well know opensouce systems loose data. One cannot cheat the CAP Theorem; if you configure systems for safety then they end up doing about the same messaging and same disk writes as paxos. So really you cannot say Paxos is slow you have to say "a part of a system which needs consistency under network partitions requires a minimum number of messages and disk flushes per operation and that is slow".
Upvotes: 0
Reputation: 1907
Check out the Raft algorithm for a consensus algorithm that is optimized for ease of understanding and clarity of implementation. Oh... it is pretty fast as well.
https://ramcloud.stanford.edu/wiki/display/logcabin/LogCabin
https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
Upvotes: 3
Reputation: 10697
The Paxos system I run (which supports really, really big web sites) is halfway in-between Basic-Paxos Multi-paxos. I plan on moving it to a full Multi-Paxos implementation.
Paxos isn't that great as a high-throughput data storage system, but it excels in supporting those systems by providing leader election. For example, say you have a replicated data store where you want a single master for performance reasons. Your data store nodes will use the Paxos system to choose the master.
Like Google Chubby, my system is run as a service and can also store data as configuration container. (I use configuration loosely; I hear Google uses Chubby for DNS.) This data doesn't change as often as user input so it doesn't need high throughput write SLAs. Reading, on the other hand, is extremely quick because it is fully replicated and you can read from any node.
Update
Since writing this, I have upgraded my Paxos system. I am now using a chain-consensus protocol as the primary consensus system. The chain system still utilizes Basic-Paxos for re-configuration—including notifying chain nodes when the chain membership changes.
Upvotes: 2
Reputation: 46
Google documented how they did fast paxos for their megastore in the following paper: Link.
Upvotes: 1
Reputation: 11
You should check the Apache Zookeeper project. It is used in production by Yahoo! and Facebook among others.
http://hadoop.apache.org/zookeeper/
If you look for academic papers describing it, it is described in a paper at usenix ATC'10. The consensus protocol (a variant of Paxos) is described in a paper at DSN'11.
Upvotes: 1
Reputation: 19601
If performance is an issue, consider whether you need all of the strong consistency guarantees Paxos gives you. See e.g. http://queue.acm.org/detail.cfm?id=1466448 and http://incubator.apache.org/cassandra/. Searching on Paxos optimised gets me hits, but I suspect that relaxing some of the requirements will buy you more than tuning the protocol.
Upvotes: 2
Reputation: 26868
Not sure if this is helpful (since this is not from actual production information), but in our "distributed systems" course we've studied, along with Paxos, the Chandra-Toueg and Mostefaoui-Raynal algorithms (of the latter our professor was especially fond).
Upvotes: 4