Gerard Murphy
Gerard Murphy

Reputation: 221

Kafka as a message queue for long running tasks

I am wondering if there is something I am missing about my set up to facilitate long running jobs.

For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).

I have the following in order to achieve the competing consumer pattern:

My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this. However this comes with some negative consequences:

As it stands at the moment I see that I can proceed as follows:

However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this. Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable. I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.

I appreciate any advise that anybody can offer on this subject.

Upvotes: 16

Views: 15775

Answers (2)

Gerard Murphy
Gerard Murphy

Reputation: 221

The way we got around the issues we were having was as suggested in the comments. We decided to decouple the message processing from the consumer polling.

On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.

We also did some work with trying to reduce the processing times for messages. However some messages still take time that can be measured in minutes. This has worked for us now for some time with no issues.

Thanks for this suggestions in comments @Donal

Upvotes: 3

senseiwu
senseiwu

Reputation: 5279

Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.

The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.

You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.

Upvotes: 8

Related Questions