nikel
nikel

Reputation: 3564

How to handle failure of Kafka Cluster

We are going to implement a Kafka Publish Subscribe system.

Now, in the worst of the worst cases - if all the kafka brokers for a given topic go down -- what happens?

I tried this out...the publisher detects it after the default timeout for metadata fetch & throws exception if not successful.

In this case, we can monitor the exception and restart Publisher after fixing Kafka.

But, what about the consumers -- they don't seem to get any exceptions once Kafka goes down. We simply can't ask "all" the consumers to restart their systems. Any better way to solve this problem?

Upvotes: 4

Views: 3161

Answers (2)

TechMaster
TechMaster

Reputation: 261

But, what about the consumers -- they don't seem to get any exceptions once Kafka goes down. We simply can't ask "all" the consumers to restart their systems. Any better way to solve this problem?

Yes, consumer won't get any exceptions and the behavior is work as designed. However, you don't need to restart all the consumers, just make sure in your logic that consumer is calling the poll()method call regularly. Consumer is designed in a way that it does not get effected, even if there is no cluster alive. Consider the following steps to understand what will happen actually:

1: All clusters are down, there is no active cluster.

2: consumer.poll(timeout) // This will be called form you portion of code

3: Inside poll() method call in KafkaConsumer.java, following sequence of calls will take place.

poll() --> pollOnce() --> ensureCoordinatorKnown() --> awaitMetaDataUpdate()

I have highlighted the main method calls that will be called after performing logical checks internally. Now, at this point your consumer will wait until the cluster is up again.

4: Cluster up again or restarted

5: Consumer will be notified and it will start working again as normally it was before the cluster goes down.

Note:- Consumer will start receiving messages from the last offset commit, message received successfully won't be duplicated.

The described behavior is valid for (0.9.x version)

Upvotes: 4

Nautilus
Nautilus

Reputation: 2286

If the consumer (0.9.x version) is polling and the cluster goes down it should get the following exception

java.net.ConnectException: Connection refused

You can keep polling until the cluster is back again there is no need to restart the consumer, it will re-establish the connection.

Upvotes: 2

Related Questions