wandermonk
wandermonk

Reputation: 7346

Fixing under replicated partitions in kafka

In our production environment, we often see that the partitions go under-replicated while consuming the messages from the topics. We are using Kafka 0.11. From the documentation what is understand is

Configuration parameter replica.lag.max.messages was removed. Partition leaders will no longer consider the number of lagging messages when deciding which replicas are in sync.

Configuration parameter replica.lag.time.max.ms now refers not just to the time passed since last fetch request from the replica, but also to time since the replica last caught up. Replicas that are still fetching messages from leaders but did not catch up to the latest messages in replica.lag.time.max.ms will be considered out of sync.

How do we fix this issue? What are the different reasons for replicas go out of sync? In our scenario, we have all the Kafka brokers in the single RACK of the blade servers and all are using the same network with 10GBPS Ethernet(Simplex). I do not see any reason for the replicas to go out of sync due to the network.

Upvotes: 11

Views: 52776

Answers (3)

Satish Bellapu
Satish Bellapu

Reputation: 740

I faced the same issue on Kafka 2.0, On restart Kafka controller node everything caught-up on the replicas.

But still looking for the reasons why few partitions are under-replicated whereas the other partitions on the same nodes for the same topic works good, and this issue i see on a random partitions.

Upvotes: 1

kivagant
kivagant

Reputation: 1957

Do NOT run reassignment for all topics together, consider running it for small portions.

  1. Find the topic that has under-replicated partitions and where reassignment process can't be completed.
  2. Set unclean.leader.election.enable to true for this topic.
  3. Find under-replicated partition that stuck for this topic. Check its leader ID.
  4. Stop the broker (just the service, not the instance).
  5. Execute Preferred Replica Election (in yahoo/kafka-manager or manually).
  6. Start the broker back.

Repeat for the rest of topics that have the same problem.

Also I tried this advice, it didn't help me: https://stackoverflow.com/a/51063607/1929406

Upvotes: 0

Doron Levi
Doron Levi

Reputation: 468

We faced the same issue:

Solution was:

  1. Restart the Zookeeper leader.
  2. Restart the broker\brokers that are not replicating some of the partitions.

No data lose.

The issue is due to a faulty state in ZK, there was an opened issue on ZK for this, don't remember the number.

Upvotes: 16

Related Questions