jessica
jessica

Reputation: 2590

Kafka rolling restart – what is the right approach to perform Kafka rolling restart on production clusters

we are using Kafka version - 2.7.1. cluster includes 5 Kafka machines on Linux RHEL 7.6 version

we want to perform Kafka restart on all 5 brokers, but since the Kafka cluster is production cluster, then rolling restart should be the right way.

so we wrote bash script that do the following

  1. stop the kafka01 broker (by systemctl command)
  2. verify no Kafka PID (process)
  3. start kafka01 broker
  4. verify kafka01 is listening to Kafka port 6667

and continue the same steps 1-4, on kafka02, kafka03, kafka04, kafka05 additionally to above steps – 1-4.

we want to add the verification about – verify no corrupted indexes appears after starting the Kafka broker (appears on the Kafka log - server.log), before continue to the next Kafka broker

but we are not sure if this additionally step is needed

NOTE - after Kafka broker restart - usually in the server.log we can see that Kafka tries to fix corrupted indexes (so its can take about 10min to 1 hour)

Upvotes: 3

Views: 5120

Answers (2)

Vineeth
Vineeth

Reputation: 1032

A better metrics or value to check before continuing with the rolling restart on production is to look out for this metric : kafka_controller_kafkacontroller_preferredreplicaimbalancecount

We are using the modern Kafka with KRAFT and this metric is grabbed from the controller (equivalent to Zookeeper)

This metric gives the value based on whether the Kafka cluster is in a "balance" state. When the value is zero, it means it is balanced.

We are inside Kubernetes with 3 brokers and this is what we have in place for "StartupProbe" :

startupProbe:
exec:
  command:
    - sh
    - '-c'
    - >
      [[ "$(kafka-broker-api-versions --bootstrap-server
      localhost:9092 | grep 9092 | wc -l)" == "3" ]] || { echo >&2
      "Broker count is not 3"; exit 1; } && [[ "$(curl -s
      kafka-controller-headless.my-namespace:9102/metrics
      | grep
      kafka_controller_kafkacontroller_preferredreplicaimbalancecount
      | tail -n 1 | grep -o '[^ ]\+$')" == "0.0" ]] || { echo >&2
      "Replica Imbalance not 0"; exit 1; }

Upvotes: 0

Ran Lupovich
Ran Lupovich

Reputation: 1821

kafka-topics.sh --bootstrap-server kafka01:9092 --describe --under-replicated-partitions | wc -l

After you are restarting a broker make sure there is no under replicated partition before moving on to another broker,

For having no outage with one broker down in kafka cluster you need to make sure that all your topics are using replication.factor=3 and min.isr=2

In order not to passing the controller around (twice) you should check which broker is the kafka controller and restart it last.

zookeeper-shell.sh [ZK_IP] get /controller

You should restart it last to avoid passing the controller around , when restarting this broker the controller role would be assigned to another running broker so restarting it last would insure you are passing the controller only once

Upvotes: 7

Related Questions