How to ensure for Kafka Streams when listening to topics with multiple partitions that all related data is processed?

Question

I would like to know how Kafka Streams are assigned to partitions of topics for reading. As far as I understand it, each Kafka Stream Thread is a Consumer (and there is one Consumer Group for the Stream). So I guess the Consumers are randomly assigned to the partitions.

But how does it work, if I have multiple input topics which I want to join?

Example:

Topic P contains persons. It has two partitions. The key of the message is the person-id so each message which belongs to a person always ends up in the same partition.

Topic O contains orders. It has two partitions. Lets say the key is also the person-id (of the person who ordered something). So here, too, each order-message which belongs to a person always ends up in the same partition.

Now I have stream which which reads from both topics and counts all orders per person and writes it to another topic (where the message also includes the name of the person).

Data in topic P:

Partition 1: "hans, id=1", "maria, id=3"

Partition 2: "john, id=2"

Data in topic O:

Partition 1: "person-id=2, pizza", "person-id=3, cola"

Partition 2: "person-id=1, lasagne"

And now I start two streams.

Then this could happen:

Stream 1 is assigned to topic P partition 1 and topic O partition 1.

Stream 2 is assigned to topic P partition 2 and topic O partition 2.

This means that the order lasagne for hans would never get counted, because for that a stream would need to consume topic P partition 1 and topic O partition 2.

So how to handle that problem? I guess its fairly common that streams need to somehow process data which relates to each other. So it must be ensured that the relating data (here: hans and lasagne) is processed by the same stream.

I know this problem does not occur if there is only one stream or if the topics only have one partition. But I want to be able to concurrently process messages.

Thanks

Tuyen Luong · Accepted Answer

Your use case is a KStream-KTable join where KTable store info of Users and KStream is the stream of Order, so the 2 topics have to be co-partitioned which they must have same partitions number and partitioned by the same key and Partitioner. If you're using person-id as key for kafka messages, and using the same Partitioner you should not worry about this case, cause they are on the same partition number.

Updated : As Matthias pointed out each Stream Thread has it's own Consumer instance.

How to ensure for Kafka Streams when listening to topics with multiple partitions that all related data is processed?

Answers (1)

Related Questions