Rahul Vel
Rahul Vel

Reputation: 35

Kafka Producer is not retrying after Timeout

Intermittently(once or twice in a month) I am seeing the error org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for cart-topic-0: 5109 ms has passed since batch creation plus linger time in my logs due to which the corresponding message was not processed by Kafka Producer.

Though all the brokers are up and available I'm not sure why this error is being observed. Even the load is not much during this period.

I have set the retries property value to 10 in Producer configs but still, the message was not been retried. Is there anything else I need to add for the Kafka send method? I have gone through the similar issues raised, but there is no proper conclusion for this error.

Can someone please help on how to fix this.

Upvotes: 2

Views: 1324

Answers (2)

Rohitashwa Nigam
Rohitashwa Nigam

Reputation: 408

From the KIP proposal which is now addressed

We propose adding a new timeout delivery.timeout.ms. The window of enforcement includes batching in the accumulator, retries, and the inflight segments of the batch. With this config, the user has a guaranteed upper bound on when a record will either get sent, fail or expire from the point when send returns. In other words we no longer overload request.timeout.ms to act as a weak proxy for accumulator timeout and instead introduce an explicit timeout that users can rely on without exposing any internals of the producer such as the accumulator.

So basically, post this now you can additionally be able to configure a delivery timeout and retries for every async send you execute.

Upvotes: 1

Shawn
Shawn

Reputation: 288

I had an issue where retries were not being obeyed, but in my particular case it was because we were calling the get() method on send for synchronous behaviour. We hadn't realized it would impact retries.

In investigating the issue through various paths I came across the definition of the sorts of errors that are retrial

https://kafka.apache.org/11/javadoc/org/apache/kafka/common/errors/RetriableException.html

What had confused me is that timeout was listed as a retrial one.

I would normally have suggested you would want to look into if the delivery of your batches was taking too long and messages in your buffer were expiring due to increased volume, but you've mentioned that the volume isn't particularly high.

Did you determine if increasing the request.timeout.ms has an impact on the frequency of occurrence? It might be more of a treating the symptom step than the cause.

Upvotes: 0

Related Questions