nish
nish

Reputation: 7280

Kafka replication failing on one node with NotLeaderForPartitionException

I'm using kafka_2.11-0.10.2.1. I have a 5 broker setup. One one of the nodes is continuously throwing the following errors:

[2020-05-25 16:45:02,054] ERROR [ReplicaFetcherThread-0-1], Error for partition [atlas10-prod-serverSide-ssch,5] to broker 1:org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition. (kafka.server.ReplicaFetcherThread)
[2020-05-25 16:45:02,055] ERROR [ReplicaFetcherThread-0-1], Error for partition [atlas10-prod-serverSide-ssch,10] to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2020-05-25 16:44:14,753] ERROR [ReplicaFetcherThread-0-1], Error for partition [atlas10-prod-serverSide-size-chart,0] to broker 1:org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition. (kafka.server.ReplicaFetcherThread)
[2020-05-25 16:44:14,754] ERROR [ReplicaFetcherThread-0-1], Error for partition [atlas10-prod-serverSide-size-chart,5] to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2020-05-25 16:44:14,754] ERROR [ReplicaFetcherThread-0-1], Error for partition [atlas10-prod-serverSide-size-chart,12] to broker 1:org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition. (kafka.server.ReplicaFetcherThread)

I noticed that this is happening for replicas in broker 3 for partitions whose leader is 1. But I'm not sure why this is happening. Following is the description of the topics:

Topic:atlas10-prod-serverSide-size-chart    PartitionCount:15   ReplicationFactor:3 Configs:
    Topic: atlas10-prod-serverSide-size-chart   Partition: 0    Leader: 1   Replicas: 1,2,3 Isr: 1,2
    Topic: atlas10-prod-serverSide-size-chart   Partition: 1    Leader: 2   Replicas: 2,3,4 Isr: 2,4,3
    Topic: atlas10-prod-serverSide-size-chart   Partition: 2    Leader: 3   Replicas: 3,4,0 Isr: 0,4,3
    Topic: atlas10-prod-serverSide-size-chart   Partition: 3    Leader: 4   Replicas: 4,0,1 Isr: 0,1,4
    Topic: atlas10-prod-serverSide-size-chart   Partition: 4    Leader: 0   Replicas: 0,1,2 Isr: 0,1,2
    Topic: atlas10-prod-serverSide-size-chart   Partition: 5    Leader: 1   Replicas: 1,3,4 Isr: 1,4
    Topic: atlas10-prod-serverSide-size-chart   Partition: 6    Leader: 2   Replicas: 2,4,0 Isr: 0,2,4
    Topic: atlas10-prod-serverSide-size-chart   Partition: 7    Leader: 3   Replicas: 3,0,1 Isr: 0,1,3
    Topic: atlas10-prod-serverSide-size-chart   Partition: 8    Leader: 4   Replicas: 4,1,2 Isr: 1,2,4
    Topic: atlas10-prod-serverSide-size-chart   Partition: 9    Leader: 0   Replicas: 0,2,3 Isr: 0,2,3
    Topic: atlas10-prod-serverSide-size-chart   Partition: 10   Leader: 1   Replicas: 1,4,0 Isr: 0,1,4
    Topic: atlas10-prod-serverSide-size-chart   Partition: 11   Leader: 2   Replicas: 2,0,1 Isr: 0,1,2
    Topic: atlas10-prod-serverSide-size-chart   Partition: 12   Leader: 1   Replicas: 3,1,2 Isr: 1,2
    Topic: atlas10-prod-serverSide-size-chart   Partition: 13   Leader: 4   Replicas: 4,2,3 Isr: 2,4,3
    Topic: atlas10-prod-serverSide-size-chart   Partition: 14   Leader: 0   Replicas: 0,3,4 Isr: 0,4,3

Topic:atlas10-prod-serverSide-ssch  PartitionCount:15   ReplicationFactor:3 Configs:
    Topic: atlas10-prod-serverSide-ssch Partition: 0    Leader: 1   Replicas: 1,0,2 Isr: 0,1,2
    Topic: atlas10-prod-serverSide-ssch Partition: 1    Leader: 2   Replicas: 2,1,3 Isr: 1,2,3
    Topic: atlas10-prod-serverSide-ssch Partition: 2    Leader: 3   Replicas: 3,2,4 Isr: 2,4,3
    Topic: atlas10-prod-serverSide-ssch Partition: 3    Leader: 4   Replicas: 4,3,0 Isr: 0,4,3
    Topic: atlas10-prod-serverSide-ssch Partition: 4    Leader: 0   Replicas: 0,4,1 Isr: 0,1,4
    Topic: atlas10-prod-serverSide-ssch Partition: 5    Leader: 1   Replicas: 1,2,3 Isr: 1,2
    Topic: atlas10-prod-serverSide-ssch Partition: 6    Leader: 2   Replicas: 2,3,4 Isr: 2,4,3
    Topic: atlas10-prod-serverSide-ssch Partition: 7    Leader: 3   Replicas: 3,4,0 Isr: 0,4,3
    Topic: atlas10-prod-serverSide-ssch Partition: 8    Leader: 4   Replicas: 4,0,1 Isr: 0,1,4
    Topic: atlas10-prod-serverSide-ssch Partition: 9    Leader: 0   Replicas: 0,1,2 Isr: 0,1,2
    Topic: atlas10-prod-serverSide-ssch Partition: 10   Leader: 1   Replicas: 1,3,4 Isr: 1,4
    Topic: atlas10-prod-serverSide-ssch Partition: 11   Leader: 2   Replicas: 2,4,0 Isr: 0,2,4
    Topic: atlas10-prod-serverSide-ssch Partition: 12   Leader: 3   Replicas: 3,0,1 Isr: 0,1,3
    Topic: atlas10-prod-serverSide-ssch Partition: 13   Leader: 4   Replicas: 4,1,2 Isr: 1,2,4
    Topic: atlas10-prod-serverSide-ssch Partition: 14   Leader: 0   Replicas: 0,2,3 Isr: 0,2,3

EDIT - Tried restarting the broker multiple times, but could not recover. Tried restarting the controller also, but it did not help either.

Upvotes: 0

Views: 1287

Answers (1)

Mickael Maison
Mickael Maison

Reputation: 26930

Kafka 0.10.2.1 is very old and has known issues with replication. The state your cluster ended up is relatively common with that version. By usually it's pretty easy to restore it too.

In most cases, you can just restart the broker having issues. Upon restarting it should get the up to date metadata from the controller and start working again. In case it doesn't, restarting the controller usually does the trick.

You should really consider upgrading to a newer release. The replication protocol has improved a lot and is now more stable.

Upvotes: 1

Related Questions