Reputation: 6644
I have a streaming application (written over spark/storm/whatever does not matter). Kafka is used as a source of stream events. Now there are some events that take significantly larger resource (time, cpu etc) compared to other events.
There are application specific nuances in how these larger messages are dealt with in various framework. For example
Since in kafka message acking is only possible by till some messageid par partition, but not at individual message level. Whenever these larger events would come, the application would come at a halt at some point of time. And this is done to solve the tradeoff of duplicate message processing (if application dies while processing these large messages how much work can you afford to redo, since all the message after the larger message would need to be replayed). Another issue would be lag alerting, since because of larger message being stuck even if I process message after larger message, committed offset would not move.
Based on this understanding i am reaching on conclusion that, kafka is more suited when all the messages are of similar processing time in a topic (atleast spark and storm only give option of tuning things at topic level but not at individual partition level).
Hence below are the options I have
Are there other options to deal with these scenarios?
Upvotes: 0
Views: 89
Reputation: 381
Do you need to maintain the key order processing? If you do need to maintain the order for keys, you could use a specialized consumer like Confluent's Parallel Consumer: https://github.com/confluentinc/parallel-consumer. It processes different keys in parallel while ensuring that records with the same key are processed sequentially. (it also works for unordered keys) This would process small and large records in parallel from the same partition (removing the Head-of-line Blocking issue). As you suggest, an idempotence mechanism would still be useful in case of failure.
Note that Queues for Kafka is coming in KIP-923
Upvotes: 1