Reputation: 31
I have a kakfa topic with 20 partitions and 5 conusmers belonging to the same consumer group. It means that we have 4 partitions per consumer. Lets say:
The producer evenly send 10 messages to the topic. In this case, only partitions 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are going to receive messages. The remaining ones will be empty. Our problem is that consumer-0 and consumer-1 will process 4 messages and in the same time, consumer-2 will process two messages. Also, consumer 4 and 5 will do any treatement since their partitions are idle.
At the producer side, we are working with the DefaultPartitioner (kafka-client 2.3.1) so that the record are evenly sent to the partitions. We would like to ask if it is possible to produce messages fairly based on kafka consumer rather than partitions. With this manner, each consumer will process only two messages and the process complexity will be fairly distributed between consumers.
Upvotes: 3
Views: 1804
Reputation: 18485
In general, I do not think this is a good design trying to force a producer to partition the data based on the consumer. A Kafka topic should seperate the dependencies between a producer and a consumer and encapsulate them from each other.
Two main reasons to not try to achieve this:
I understand this might not actually answer your question. If you want proper balancing you should match the number of partition with consumer threads and ensure on the producer side that all messages are produced in a balanced way accross the partitions.
Remember that even when using the DefaultPartitioner
with as many topics as 20 you can still end up producing the data unbalanced as it depends oh the hash value of your keys.
Upvotes: 0
Reputation: 4024
I think the calculations you made are non-relevant, because there's no scenario only 10 messages will be sent, and if this is really the situation you should consider using less partitions and relatively less consumers in the consumer group.
You can assume that for larger number of records in the stream, your producer will distribute the load roughly evenly between partitions and therefore between consumers, and now you don't care if consumer-1 received 1000 records and consumer-2 received 998.
Remember also that if the loads are changing, and for lower phases you don't won't consumers to be idle but to handle the same loads, this is completely OK that some consumers gets 4 messages, others 2, and others 0, because processing 4 messages is basically being kind of "idle" in relation to the loads you are expecting, and these differences are so minor they doesn't really count; so let Kafka do the magic for the higher loads when process power/time really matters.
Upvotes: 1