Karim Tawfik
Karim Tawfik

Reputation: 1486

Kafka streams 1.0: processing timeout with high max.poll.interval.ms and session.timeout.ms

I am using a stateless processor using Kafka streams 1.0 with kafka broker 1.0.1

The problem is, the CustomProcessor get closed every few seconds, which resulted in rebalance signal, I am using the following configs:

session.timeout.ms=15000

heartbeat.interval.ms=3000 // set it to 1/3 session.timeout

max.poll.interval.ms=Integer.MAX_VALUE // make it that large as I am doing a intensive computational operations that might take up to 10 mins processing 1 kafka message (NLP operations)

max.poll.records=1

despite this configuration and my understanding of how kafka timeout configurations work, I see the consumer rebalancing every few seconds.

I already went through the below article and other stackoverflow questions. about how to tune the long time operations and avoid very long session timeout that will make failure detection so late, however I still see unexpected behavior, unless I misunderstand something.

KIP-62

Diff between session.timeout.ms and max.poll.interval

Kafka kstreams processing timeout

For the consumer environment setup, I have 8 machines each 16 code, and consuming from 1 topic with 100 partitions, I am following what practice this confluent doc here recommends.

Any pointers?

Upvotes: 1

Views: 1491

Answers (1)

Karim Tawfik
Karim Tawfik

Reputation: 1486

I figured it out. after lots of debugging and enable verbose logging for both kafka streams client and the broker, it turned out to 2 things:

  1. There is a critical bug in streams 1.0.0 (HERE), so I upgraded my client version from 1.0.0 to 1.0.1
  2. I update the value of the consumer property default.deserialization.exception.handler from org.apache.kafka.streams.errors.LogAndFailExceptionHandler to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler.

After the above 2 changes, everything went so perfect with no restarts, I am using grafana to monitor the restarts, and for the past 48 hours, there is no single restart happened.

I might do more troubleshooting to make sure which of the 2 items above make the real fix, but I am on a hurry to deploy to production, so if anybody is intrested to start from there, go ahead, else, once I got time will do the further analysis and update the answer!

So happy to get this fixed!!!

Upvotes: 2

Related Questions