Drex
Drex

Reputation: 3851

Should consumer or client to produce retry event?

Let's say we have a Kafka consumer poll from a normal topic that is heavy loaded and for each event, make a client call to service. The duration of client call may vary, sometimes fast sometimes slow, we have a retry topic so whenever client call has issue, we'll produce a retry event.

Here is an interesting design question, which domain should be responsible for producing the retry event?

  1. If we let consumer to handle retry produce, this means we have to let consumer to wait for our client call gets finished, which would bring risk of consumer lag because our event processing speed would become slow
  2. If we let service to handle retry produce, this solve the consumer lag issue as consumer would just act as send and forget. However, when service tries to produce a retry event but fails, our retry record might get lost forever in current client call

I also think of having additional DB for persisting retry events, but this would bring more concern on what if DB write operations fails and we might lose the retry similarly as kafka produce error out

The expectation would be keep it more resilient so that all failed event may get a chance for retry and at same time, should also avoid consumer lag issue

Upvotes: 1

Views: 767

Answers (2)

Gary Russell
Gary Russell

Reputation: 174779

With Spring for Apache Kafka, the DeadletterPublishingRecoverer (which can be used to publish to your "retry" topic) has a property failIfSendResultIsError.

When this is true (default), the recovery operation fails and the DefaultErrorHandler will detect the failure and re-seek the failed consumer record so that it will continue to be retried.

The non-blocking retry mechanism uses this recoverer internally so the same behavior will occur there too.

https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic

Upvotes: 1

Jessica Vasey
Jessica Vasey

Reputation: 392

I'm not sure I completely understand the question, but I will give it a shot. To summarise, you want to ensure the producer retries if the event failed.

The producer retries default is 2147483647. If the produce request fails, it will keep retrying.

However, produce requests will fail before the number of retries are exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. The default for delivery.timeout.ms is 2 mins so you might want to increase this.

To ensure the producer always sends the record you also want to focus on the producer configurations acks.

If acks=all, all replicas in the ISR must acknowledge the record before it is considered successful. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee.

The above can cause duplicate messages. If you wanted to avoid duplicates, I can also let you know how to do that.

Upvotes: 2

Related Questions