Kafka rolling restart – what is the right approach to perform Kafka rolling restart on production clusters

Question

we are using Kafka version - 2.7.1. cluster includes 5 Kafka machines on Linux RHEL 7.6 version

we want to perform Kafka restart on all 5 brokers, but since the Kafka cluster is production cluster, then rolling restart should be the right way.

so we wrote bash script that do the following

stop the kafka01 broker (by systemctl command)
verify no Kafka PID (process)
start kafka01 broker
verify kafka01 is listening to Kafka port 6667

and continue the same steps 1-4, on kafka02, kafka03, kafka04, kafka05 additionally to above steps – 1-4.

we want to add the verification about – verify no corrupted indexes appears after starting the Kafka broker (appears on the Kafka log - server.log), before continue to the next Kafka broker

but we are not sure if this additionally step is needed

NOTE - after Kafka broker restart - usually in the server.log we can see that Kafka tries to fix corrupted indexes (so its can take about 10min to 1 hour)

Ran Lupovich · Accepted Answer

kafka-topics.sh --bootstrap-server kafka01:9092 --describe --under-replicated-partitions | wc -l

After you are restarting a broker make sure there is no under replicated partition before moving on to another broker,

For having no outage with one broker down in kafka cluster you need to make sure that all your topics are using replication.factor=3 and min.isr=2

In order not to passing the controller around (twice) you should check which broker is the kafka controller and restart it last.

zookeeper-shell.sh [ZK_IP] get /controller

You should restart it last to avoid passing the controller around , when restarting this broker the controller role would be assigned to another running broker so restarting it last would insure you are passing the controller only once

Kafka rolling restart – what is the right approach to perform Kafka rolling restart on production clusters

Answers (2)

Related Questions