user50874
user50874

Reputation: 331

How do Raft guarantee consistency when network partition occurs?

Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.

The original paper says:

Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.

Isn't it too expensive? The leader has to talk to majority nodes for every read request?

Upvotes: 19

Views: 3653

Answers (4)

GManNickG
GManNickG

Reputation: 503805

I'm surprised there's so much ambiguity in the answers, as this is quite well known:

Yes, to get linearizable reads from Raft you must round-trip through the quorum.

There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.

Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.

Jepsen covered this error detail, but to quote the conclusion:

[There are] three types of reads, for varying performance/correctness needs:

  • Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
  • Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
  • Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.

Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.

For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)

The point of FPaxos is that this is an engineering tradeoff you get to make.

Upvotes: 9

WebServer
WebServer

Reputation: 1396

Question

The leader has to talk to majority nodes for every read request

Answer: No.


Explaination

Let's understand it with code example from HashiCorp's raft implementation.

There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)

  • LeaderLease timeout[1]
  • Election timeout[2]

Example of their values are 500ms & 1000ms respectively[3]

Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]

Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]

Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.

[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141 [2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179 [3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230 [4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272 [5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275 [6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456 [7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762 [8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891 [9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894

Upvotes: -1

edwardpku
edwardpku

Reputation: 71

Not sure whether timeout configure can solve this problem:

2 x heartbeat interval <= election timeout

which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.

After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.

Upvotes: 0

Michael Deardeuff
Michael Deardeuff

Reputation: 10697

The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.

It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)

Upvotes: 1

Related Questions