Kafka as a message queue for long running tasks

Question

I am wondering if there is something I am missing about my set up to facilitate long running jobs.

For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).

I have the following in order to achieve the competing consumer pattern:

A topic
X consumers in the same group
P partitions in a topic (where P >= X always)

My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this. However this comes with some negative consequences:

if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of max.poll.interval.ms for a rebalance
if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of max.poll.interval.ms for a rebalance to occur in order to process any new messages

As it stands at the moment I see that I can proceed as follows:

Set max.poll.interval.ms to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance

However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this. Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable. I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.

I appreciate any advise that anybody can offer on this subject.

Gerard Murphy · Accepted Answer

The way we got around the issues we were having was as suggested in the comments. We decided to decouple the message processing from the consumer polling.

On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.

We also did some work with trying to reduce the processing times for messages. However some messages still take time that can be measured in minutes. This has worked for us now for some time with no issues.

Thanks for this suggestions in comments @Donal

Kafka as a message queue for long running tasks

Answers (2)

Related Questions