Can Kafka tolerate N-1 failures?

Question

I'm reading the documentation for Kafka and it says here:

For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.

http://kafka.apache.org/documentation.html#introduction (It's right above 1.2 Use Cases)

How this is possible? From my understanding the topics under the hood use ZooKeeper which uses Zab (A Paxos-like algorithm). I couldn't find any documentation about Zab asides from this page:

https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos

Can someone explain to me how they can support N-1 failures. Isn't N-1 literally just everything down asides from the machine itself?

Also, if anyone know any good places to read up or videos on Zab please let me know.

Asides from this http://web.stanford.edu/class/cs347/reading/zab.pdf because I was hoping for something easier.

Thanks

Morgan Kenyon · Accepted Answer

I can help you answer the Kafka/Zookeeper part of your question. I think you're confusing how Kafka and Zookeeper work together.

I think it's probably better to think of Kafka and Zookeeper operating independently, but needing both to work to get the job done. Both Kafka and Zookeeper can fail on their own accord.

The Zookeeper ensemble could fail, causing Kafka to stop working but only because Zookeeper has failed, not because there's something wrong with the Kafka Cluster.
The Kafka Cluster could fail, Zookeeper would still be working. But since Kafka is down the system as a whole doesn't work.

Both Kafka and Zookeeper have different rules for what constitutes failure.

A Zookeeper ensemble will continue to work as long as a majority of the Zookeeper servers are running. So if you have 7 Zookeeper Servers, it can handle up to 3 failures before the Zookeeper ensemble stops working. [reference]
Kafka has a different qualification for how it works. Kafka will continue to operate as long as one Kafka machine stays alive, the N-1 figure as you've quoted.

I don't know anything about the algorithm you mention that is used in Zookeeper, Zab (A Paxos-like algorithm), but to my understanding this is how Kafka and Zookeeper work together.

Can Kafka tolerate N-1 failures?

Answers (1)

Related Questions