Spring-Kafka: Impact of Consumer Group Rebalancing on Stateful Retry

Question

If using SeekToCurrentErrorHandler with stateful retry, such that the message is polled from the broker for each retry, there is a risk that for a long retry period that a consumer group rebalance could cause the partition to be re-assigned to another consumer. Hence the stateful retry period/attempts would be reset, as the new consumer has no knowledge of the state of the retry.

Taking an example, if a retry max period was 24 hours, but consumer group re-balancing was happening on average every 12 hours, the retry could never complete, and the message (and those behind it) would eventually expire from the topic once they exceeded the retention period. (Assuming the cause of the retryable exception was not resolved in this time). The message would not end up on the DLT after 24 hours as expected, as retries would not be exhausted due to the reset.

I assume that even if a consumer is retrying by re-polling messages, there is no guarantee that following a re-balance that this consumer would retain assignment to this partition. Or is it the case that we can be confident that so long as this consumer instance is alive that it would typically retain assignment to the partition it is polling?

Are there best practises/guidelines on use of stateful retry to cater for this?

Stateless retry means any total retry time that exceeds the poll timeout would cause rebalancing and duplicate message delivery. To avoid that then the retry period must be very limited. Or is the guideline to allow this, ensure messages are deduplicated by the consumer, so that the duplicate messages are acceptable and long running stateless retries can be configured?

Is the only safe and stable option for enabling a retry period of something like several hours (e.g. to cater for a service being unavailable for this period) to use retry topics?

Thanks, Rob.

Gary Russell · Accepted Answer

The whole point of stateful retry was to avoid a rebalance; without it, the consumer would be delayed up to the aggregate of all retry attempt delays.

However, retry in the listener adapter (including stateful retry) has now been deprecated because the error handler can now do everything the RetryTemplate can do (back off, exception classification, etc, etc).

With stateful retry (or backoffs in the error handler), the longest back off must be less than max.poll.interval.ms.

A 24 hour backoff is, frankly, ridiculous - it would be better to just stop the container and restart it a day later.

Spring-Kafka: Impact of Consumer Group Rebalancing on Stateful Retry

Answers (1)

Related Questions