Kafka Consumer - Random access fetching of partition range

Question

The question: How can I randomly fetch an old chunk of messages with a given range definition of [partition, start offset, end offset]. Hopefully ranges from multiple partitions at once (one range for each partition). This needs to be supported in a concurrent environment too.

My ideas for solution so far I guess I can use a pool of consumers for the concurrency, and for each fetch, use Consumer.seek and Consumer.poll with max.poll.records. But this seems wrong. No promise that I will get the same exact chunk, for example in a case when a message get deleted (using log compact). As a whole this seek + poll method not seems like the right fit for one time random fetch.

My use case: Like the typical consumer, mine reads 10MB chunks of messages and processes it. In order to process that chunk I am pushing 3-20 jobs to different topics, in some kind of workflow. Now, my goal is to avoid pushing the same chunk into the other topics again and again. Seems to me that it is better to push a reference to that chunk. e.g. [Topic X / partition Y, start offset, end offset]. Then, on the processing of the jobs, it will fetch the exact chunk again.

OneCricketeer · Accepted Answer

Your idea seems fine, and is practically the only solution with the Consumer API. There's nothing you can do once messages are removed between offsets.

If you really needed every single message between each and every possible offset range, then you should consider consuming that data as it's actively produced into some externally indexable destination where offset scans are also a common operation. Plenty of Kafka Connectors exist, and lots of databases or filesystems. But the takeaway here is that, I think you might have to reconsider your options for these "reprocessing" jobs

Kafka Consumer - Random access fetching of partition range

Answers (1)

Related Questions