Reputation: 121
While working Spark Structured Streaming and Kinesis Streams, I have experienced imbalance reads when reprocessing data that has accumulated in the stream (as opposed to reading from the latest).
The next graph shows the difference in read speed of the kinesis shards that are part of the stream.
That makes spark jobs to drop a lot of events because events with very different event time get mixed up and those consider old are dropped.
Recently a team member suggested to use Kafka instead. I was a bit skeptical about Apache Kafka solving this issue because AFAIK the only way to fix the imbalance reads I described above is to introduce some kind of coordination at the consumer level. This is how the kinesis conector of Apache Flink provides alignment while reprocessing Kinesis streams (event-time-alignment-for-shard-consumers)
I have been investigating more in depth about the architecture and design on Apache Kafka but I can't see anything that resembles a coordination mechanism for consumer groups.
Still, after some tests, the reprocessing of messages from a Kafka topic is a lot more consistent among partitions. It seems like there is some coordination mechanism.
I'm aware that introducing a coordination among nodes in a distributed system like Kafka will decrease the throughput (which is the price to pay when using the shard alignment in the Flink connector for kinesis). That's what makes me even more curious, how can this possibly happen? How can Apache Kafka achieve this without a coordination mechanism?
Upvotes: 1
Views: 427
Reputation: 1821
The Kafka Partioner decides on how the messages would be distributed across the different partitions in the kafka topic.
By default, kafka producer is using murmur2 hash algorithm to decide where to send each key. By using this hash, Kafka promises that you will have same keys in the same partition.
If your use case is not requiring ordering between the events you might not need to send key at all.
When not sending a key, messages would be distributed across the partitions in "round robin"
When consumer is joining a consumer group it is assigned a partition/s which it would be responsible of processing alone. No other consumer from the same consumer group would share this ownership on the partition/s.
So to your question, if your producer distributed your messages evenly across the partitions of the topic, and you have even number of consumer threads in the consumer group, they will be responsible for the same number of partitions and consumption would be "same" across the consumers
Upvotes: 1