user1128016
user1128016

Reputation: 1538

Whats the difference between Paxos and W+R>=N in Cassandra?

Dynamo-like databases (e.g. Cassandra) can enforce consistency by means of quorum, i.e. a number of synchronously written replicas (W) and a number of replicas to read (R) should be chosen in such a way that W+R>N where N is a replication factor. On the other hand, PAXOS-based systems like Zookeeper are also used as a consistent fault-tolerant storage.

What is the difference between these two approaches? Does PAXOS provide guarantees that are not provided by W+R>N schema?

Upvotes: 47

Views: 12214

Answers (6)

Sriram Srinivasan
Sriram Srinivasan

Reputation: 1325

W+R > N ensures that a read would encounter at least one write from the last successful write. This is a necessary but not sufficient foundation.

Paxos and Cassandra and others are built on top of this foundation; specifically a majority read/write quorum (a specific instance of W+R > N where both W and R are more than N/2)

You can think of Basic Paxos as a key value store with a single key (what the literature refers to as a "register". We can use the register's name as a notional key). Basic Paxos gives you a wait-free, write once, key-value store with a single key. Once that key has been written to (with the value chosen from among competing proposals), the value is frozen. We have achieved consensus. Abstractly, you can have a mutable store by using a <register name, version number> as a key, so you get an infinity of registers. This is essentially multipaxos, where each cell of the log is a basic paxos register.

Cassandra is a mutable KV store that overrides previous values.

That is the first difference. (ignore the fact that Cassandra can take many of key/value pairs; we can treat them both as kv stores abstractly for now regardless of cardinality)


Next, it is not sufficient to say "W+R>N" or "quorum majority". Consider:

  1. Client C1 writes value A to servers 1 and 2 (server 3 is unavailable)
  2. Client C2 writes value B to servers 2 and 3
  3. Now server 2 dies, but 1 and 3 are up.
  4. Client C3 reads A from server 1, and B from server 3.

How does C3 resolve the tie? It needs some meta data to say which one is a later write.

Cassandra resolves this by attaching a timestamp and says "last writer wins". But no two clocks are synchronized perfectly for ever. It is very possible that C1's write of A had a timestamp (say) of 100, and C2's write, although happening later, has a timestamp of 10, because C2's clock is running slow. C3 will thus infer "A" as the later write, because the one with the higher timestamp wins. This is wrong. Cassandra will lose updates in such a scenario.

To get a linearizable store -- whether write-once Paxos style, or a linearizable mutable key value store (S3, for example), it is necessary to ensure that the metadata associated with the value be monotonically increasing, so that a later write has a later value. Paxos and others ensure this by using increasing version numbers (called ballots in the paxos paper, term in the Raft paper). Before a write, a server reads the highest version available on other servers, and uses the next highest version. The max version number is always increasing. On read, a server reads a majority and uses the fact that one or more of them will have the highest version number (the last write). It will then write that version back to those that are lagging before returning the result. In other words, a write always involves a read first, and a read sometimes involves writing back to laggards (if any)

Upvotes: 0

oldtimer
oldtimer

Reputation: 59

As mentioned in other answers, in an R+W > N system, the writes are not atomic on all nodes which means that when a write is in progress (or during a write failure) some nodes will have newer values and some older ones. Take an example of a system where n=3, r=2, and w=2. For clarity let's assume the 3 nodes are named A, B, and C. Consider this scenario: a write is in progress; node A has been updated while B and C are still in process of receiving the updated value. Clients reading from A and B will see the newer value (resolved using version vectors or last write wins) and clients reading from B and C will see old values. This type of read is not considered linearizable. Such issues will not occur with proper linearizable systems such as Paxos or Raft.

Upvotes: 5

mcdowella
mcdowella

Reputation: 19611

Paxos is non-trivial to implement, and expensive enough that many systems using it use hints as well, or use it only for leader election, or something. However, it does provide guaranteed consistency in the presence of failures - subject of course to the limits of its particular failure model.

The first quorum based systems I saw assumed some sort of leader or transaction infrastructure that would ensure enough consistency that you could trust that the quorum mechanism worked. This infrastructure might well be Paxos-based.

Looking at descriptions such as https://cloudant.com/blog/dynamo-and-couchdb-clusters/, it would appear that Dynamo is not based on an infrastructure that guarantees consistency for its quorum system - so is it being very clever or cutting corners? According to http://muratbuffalo.blogspot.co.uk/2010/11/dynamo-amazons-highly-available-key.html, "The Dynamo system emphasizes availability to the extent of sacrificing consistency. The abstract reads "Dynamo sacrifices consistency under certain failure scenarios". Actually, later it becomes clear that Dynamo sacrifices consistency even in the absence of failures: Dynamo may become inconsistent in the presence of multiple concurrent write requests since the replicas may diverge due to multiple coordinators." (end quote)

So, it would appear that in the case of quorums as implemented in Dynamo, Paxos provides stronger reliability guarantees.

Upvotes: 18

Ezra Hoch
Ezra Hoch

Reputation: 1752

Paxos and the W+R>N quorum try to solve slightly different problems. Paxos is usually described as a way to replicate a state machine, but in fact it is more of a distributed log: each item written to the log gets an index, and the different servers eventually hold the same log items + their index. (Replicated state machine can be achieved by writing to the log the inputs to the state machine and each server replays the state machine on the agreed inputs according to their index). You can read more about Paxos in a blog post I wrote here.

The W+R>N quorum solves the problem of sharing a single value among multiple servers. In the academia it is called "shared register". A shared register has two operations: read and write, where we expect the read to return the value of the previous write.

So, Paxos and the W+R>N quorum live in different domains, and have different properties (e.g., Paxos saves an ordered list of items). However, Paxos can be used to implement a shared register, and a W+R>N quorum can be used to implement a distributed log (although, very inefficiently).

Saying all the above, sometimes the W+R>N quorums aren't implemented in their "fully robust" way, as it will require more than one communication round. Thus, in systems that want low latency, it is possible that their implementation of W+R>N quorums provide weaker properties (e.g., conflicting values can co exist).

To sum up, theoretically, Paxos and the W+R>N can achieve the same goals. Practically, it would be very inefficient, and each one is better for something slightly different. Even more practically, W+R>N isn't always implemented fully, thus scarifying some consistency properties for speed.

Update: Paxos supports a very general failure model: messages can be dropped, nodes can crash and restart. The W+R>N quorum scheme has dfferent implementations, many of which assume less general failures. So, the difference between the two also depends on the assumption on the possible failures that are supported.

Upvotes: 15

Chen Fu
Chen Fu

Reputation: 27

There is no difference. The definition of a quorum says that any two quorums' intersection is not empty. Simple majority quorum is an example NOT a definition. Take a look at Dr. Lamport's later paper "Vertical Paxos", where he gave some other possible configuration of quorums.

Multi-decree paxos protocol (AKA Multi-Paxos), in steady state it's just two phase commit. Ballot number changes are only needed when the leader fails.

Zookeeper's replication protocol (ZAB) , and RAFT are all based on Paxos. The differences are in fault-detection and transition after a leader fails.

Upvotes: 1

Hampus
Hampus

Reputation: 637

Yes, Paxos provides guarantees that are not provided by the Dynamo-like systems and their read-write quorums. The difference is how failures are handled and what happens during a write. After a successful write, both kind of systems behave similarly. The data will be saved and available for reading afterwards (until overwritten or deleted) and so on.

The difference appears during a write and after failures. Until you get a successful answer from W nodes when writing something to the eventually consistent systems, then the data may have been written to some nodes and not to others and there is no guarantee that the whole system agrees on the current value. If you try to read the data back at this point, some clients may get the new data back and some the old data back. In other words, the system is not immediately consistent. This is because writes aren't atomic across nodes in these systems. There are usually mechanisms to "heal" an inconsistency like this and "eventually" the system will become consistent again (i.e. reads will once again always return the same value, until something new is written). This is the reason why they are often called "eventually consistent". Inconsistencies can (and will) appear, but they will always be dealt with and reconciled eventually.

With Paxos, writes can be made atomic across nodes and inconsistencies between nodes are therefore possible to avoid. The Paxos algorithm makes it possible to guarantee that non-faulty nodes never disagree on the outcome of a write, at any point in time. Either the write succeeded everywhere or nowhere. There will never be any inconsistent reads at any point (if it's correctly implemented and if all the assumptions hold, of course). This comes at a cost, however. Mainly, the system may need to delay some requests and be unavailable when for example too many nodes (or the communication between them) aren't working. This is necessary to assure that no inconsistent replies are given.

To summarize: the main difference is that the Dynamo-like systems can return inconsistent results during writes or after failures for some time (but will eventually recover from it), whereas Paxos based systems can guarantee that there are never any such inconsistencies by sometimes being unavailable and delaying requests instead.

Upvotes: 42

Related Questions