Reputation: 21

In distributed systems, why do we need 2n+1 nodes to handle n failures?

I recently read that in order to handle the failure of n-nodes, the data has to be replicated on 2n+1 nodes. I could not understand the reasoning behind that. Could someone please explain?

Upvotes: 2

Answers (2)

jop

Reputation: 2316

This is the valid quorum configuration that requires the least number n of processes to tolerate f faults.

In detail, for fault tolerance, you can never wait for reading or writing to all processes, otherwise you'll block when at least one of them crashes. You need to read and write from sub-sets.

Given that you're not writing and reading all of them, you have to be sure that (1) you read from at least one process that has the latest version of data and that (2) every two writes intersect, such that one of them aborts. These are the quorum rules.

Finally, having n = 2f+1 processes and writing to f+1 is the configuration where you need the least n for f. You might still obey the quorum with a larger write quorum with a smaller read quorum, but then you need more processes to ensure that writes never block waiting for failed processes.

Upvotes: 4

coding_buddha

Reputation: 61

Ok, so think about it like this. A polynomial of degree n is defined uniquely by n+1 points. The proof for this is rather long and requires some knowledge of linear algebra so I will just link it here. Thus, if you want to send a message, you can derive the polynomial that encodes the message ( optimally through some mutually agreed standard so the person who receives the message will know what to do ). But how many points do you send through your channel? If you know the channel will drop n packets and the person receiving requires n+1 packets to read the message, you will need to interpolate your polynomial using the n+1 points you want to send and then calculate n additional points that lie on that polynomial and send the whole set of 2n+1 points so that the person receiving will always be able to reconstruct your polynomial and read the message.

Upvotes: 1

In distributed systems, why do we need 2n+1 nodes to handle n failures?

Answers (2)

Related Questions