Nicholas Kou
Nicholas Kou

Reputation: 303

Kafka Consumer reading rate seems to decrease when adding consumers on the same group id

I have created a Kafka topic on which I have produced data (5gb of csv records). I set the number of partitions to be equal to the number of the consumers i'm going to use. My Apache Kafka setup is composed of 3 brokers. I don't use replication for my data.

When the topic is consumed from a node (case of having only one consumer and one partition), the consumer gets the data with a rate of 65K records/sec.

When the topic is consumed from two nodes (case of having two consumers and two parititions), each of the consumers get the data with an overall rate of 120K records/sec (60K for each consumer).

Adding more consumers and partitions (for example 10 more), the throughtput for each consumer is decreased and the overall throughtput seems to be stabilized to a value (reaching 420K).

Is this an expected behaviour for Apache Kafka? I was awaiting that by adding more and more consumers, the overall throughtput would increase linearly.

Upvotes: 0

Views: 1274

Answers (2)

Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39820

More partitions in a Kafka cluster leads to higher throughput however, you need to be aware that the number of partitions has an impact on availability and latency.

In general more partitions,

  • Lead to higher throughput
  • Require more open file handles
  • May increase unavailability
  • May increase end-to-end latency
  • May require more memory on the client side

You need to study the trade-offs and make sure that you've picked the number of partitions that satisfies your requirements regarding throughput, latency and required resources.

For further details refer to this blog post from Confluent.

Upvotes: 0

Shailendra
Shailendra

Reputation: 9102

If there are more consumers in consumer group than number of partitions then they remain idle. A picture from Definitive guide to Kafka book will be helpful

enter image description here As far as consumer throughput is concerned - apart from the number of partitions/consumer it will also depend upon how the consumer is processing. There could be bottleneck in the message consuming which can limit throughput. This fact is also corroborated here in a write-up by Confluent

The consumer throughput is often application dependent since it corresponds to how fast the consumer logic can process each message. So, you really need to measure it.

Upvotes: 1

Related Questions