user_1357
user_1357

Reputation: 7940

GKE Upgrade Causing Kafka to become Unavailable

I have a Kafka cluster hosted in GKE. Google updates GKE nodes in weekly basis and whenever this happens Kafka becomes temporarily unavailable and this causes massive error/rebalance to get it backup to healthy state. Currently we rely on K8 retry to eventually succeed once upgrade completes and cluster becomes available. Is there a way to gracefully handle this type of situation in Kafka or avoid it if possible?

Upvotes: 0

Views: 192

Answers (2)

Murugesa Pandian
Murugesa Pandian

Reputation: 11

You have control over the GKE Node's auto upgrade through "upgrade maintenance window" to decide when upgrades should occur. Based on your business criticality you can configure this option along with K8 retry feature.

Upvotes: 0

Fares
Fares

Reputation: 650

In order to be able to inform you better, you would have to give us a little more information, what is your setup? Versions of Kube and Kafka? How many Kafka & ZK pods? How are you deploying your Kafka cluster (via a simple helm chart or an operator?) What are the exact symptoms you see when you upgrade your kube cluster? What errors do you get? What is the state of the Kafka cluster etc.? How do you monitor it?

But here are some points worth investigating.

  • Are you spreading the Kafka/ZK pods correctly across the nodes/zones?
  • Do you set PDBs to a reasonable maxUnavailable setting?
  • What are your readiness/liveness probes for your Kafka/ZK pods?
  • Are your topics correctly replciated?

I would strongly encourage you to use take a look at https://strimzi.io/ which can be very helpful if you want to operate Kafka on Kube. It is open source operator and very well documented.

Upvotes: 2

Related Questions