Sumit Sinha
Sumit Sinha

Reputation: 666

Why isn't kafka continuing to work on fail of one of the brokers?

I am under the impression that with two brokers with sync turned on my kafka setup should keep on working even on fail of one of the broker.

To test it I made a new topic named topicname. Its description is as follows:

Topic:topicname    PartitionCount:1 ReplicationFactor:1 Configs:
Topic: topicname    Partition: 0    Leader: 0   Replicas: 0 Isr: 0

Then I ran producer.sh and consumer.sh in the following way:

bin/kafka-console-producer.sh --broker-list localhost:9092,localhost:9095 sync --topic topicname

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic topicname --from-beginning

Till both the brokers were working I saw that messages were being received properly by the consumer, but when I killed one of the instance of the brokers through kill command then the consumer stopped showing me any new messages. Instead it showed me the following error message:

WARN [ConsumerFetcherThread-console-consumer-57116_ip-<internalipvalue>-1438604886831-603de65b-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 865; ClientId: console-consumer-57116; ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; RequestInfo: [topicname,0] -> PartitionFetchInfo(9,1048576). Possible cause: java.nio.channels.ClosedChannelException (kafka.consumer.ConsumerFetcherThread)
[2015-08-03 12:29:36,341] WARN Fetching topic metadata with correlation id 1 for topics [Set(topicname)] from broker [id:0,host:<hostname>,port:9092] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)

Upvotes: 9

Views: 14218

Answers (5)

EricOops
EricOops

Reputation: 449

I think there are two things can make your consumer not work after a broker down for kafka HA cluster:

  1. --replication-factor should bigger than 1 for your topic. so every topic partition can have at least one backup.

  2. replication factor for internal topics for kafka configuration should also bigger than 1:

    offsets.topic.replication.factor = 3

    transaction.state.log.replication.factor = 3

    transaction.state.log.min.isr = 2

This two modification make my producer and consumer still work after broker shutdown (5 broker and every broker goes down once) .

Upvotes: 3

Gwen Shapira
Gwen Shapira

Reputation: 5158

You can see in the topic description that you posted that your topic has only a single replica. With a single replica there is no fault tolerance and if broker 0 (the broker that contains the replica) goes away, the topic will be unavailable.

Create a topic with more replicas (with --replication-factor 3) to have fault tolerance in case of crashes.

Upvotes: 2

user3881282
user3881282

Reputation: 51

I had run into into the same problem even when using a topic with replication factor of 2. Setting the following property on the producer worked for me. "metadata.max.age.ms". (Kafka-0.8.2.1)

Else, my Producer was waiting for 1 minute by default to fetch the new leader and start contacting it

Upvotes: 1

Gaurav
Gaurav

Reputation: 41

I had this similar problem, setting the producer config "topic.metadata.refresh.interval.ms" to -1 (or whatever value is suitable for you) solved the issue for me. So in my case , I had 3 broker (multi broker set up on my local machine) and created the topic with 3 partitions and replication factor 2.

Test set up:

Before the producer config:

Tried 3 brokers running , killed one of the brokers after producer started, the local Zookeeper updated the ISR and topic metadata info (removed down broker as leader) but the producer did not pick it up (may be due to default 10 mins refresh time).So messages end up failing. I get send exceptions.

After the producer config (-1 in my case):

Tried 3 brokers running , killed one of the brokers after producer started, the local Zookeeper updated the ISR info (removed down broker as leader), the producer refreshed the new ISR/topic metadata info and messages send did not fail.

-1 makes it refresh topic metadata on each failed attempt so may be you want to reduce the refresh time to something reasonable instead.

Upvotes: 4

user2720864
user2720864

Reputation: 8161

For a topic with replication factor N, Kafka tolerate up to N-1 server failures. E.g. having a replication factor 3 will allow you to handle upto 2 server failure.

Upvotes: 0

Related Questions