apache-kafkakafka-consumer-apikafka-producer-api

Reputation: 469

How to choose the no of partitions for a kafka topic?

We have 3 zk nodes cluster and 7 brokers. Now we have to create a topic and have to create partitions for this topic.

But I did not find any formula to decide that how much partitions should I create for this topic. Rate of producer is 5k messages/sec and size of each message is 130 Bytes.

Thanks In Advance

Upvotes: 16

Answers (6)

Sriharsha g.r.v

Reputation: 510

For example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.

So a simple formula could be:

#Partitions = max(NP, NC) where:

NP is the number of required producers determined by calculating: TT/TP

NC is the number of required consumers determined by calculating: TT/TC

TT is the total expected throughput for our system

TP is the max throughput of a single producer to a single partition

TC is the max throughput of a single consumer from a single partition

source : https://docs.cloudera.com/runtime/7.2.10/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html

Upvotes: 4

Chi_Cong

Reputation: 26

You could choose the no of partitions equal to maximum of {throughput/#producer ; throughput/#consumer}. The throughput is calculated by message volume per second. Here you have: Throughput = 5k * 130bytes = 650MB/s

Upvotes: 0

H. Opler

Reputation: 369

This old benchmark by Kafka co-founder is pretty nice to understand the magnitudes of scale - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

The immediate conclusion from this, like Vanlightly said here, is that the consumer handling time is the most important factor in deciding on number of partition (since you are not close to challenge the producer throughput).

maximal concurrency for consuming is the number of partitions, so you want to make sure that:

((processing time for one message in seconds x number of msgs per second) / num of partitions) << 1

if it equals to 1, you cannot read faster than writing, and this is without mentioning bursts of messages and failures\downtime of consumers. so you will need to it to be significantly lower than 1, how significant depends on the latency that your system can endure.

Upvotes: 7

Ranjan1302

Reputation: 61

Partitions = max(NP, NC)

where:

NP is the number of required producers determined by calculating: TT/TP NC is the number of required consumers determined by calculating: TT/TC TT is the total expected throughput for our system TP is the max throughput of a single producer to a single partition TC is the max throughput of a single consumer from a single partition

Upvotes: 3

gokhansari

Reputation: 2439

It depends on your required throughput, cluster size, hardware specifications:

There is a clear blog about this written by Jun Rao from Confluent: How to choose the number of topics/partitions in a Kafka cluster?

Also this might be helpful to have an insight: Apache Kafka Supports 200K Partitions Per Cluster

Upvotes: 4

Vanlightly

Reputation: 1479

I can't give you a definitive answer, there are many patterns and constraints that can affect the answer, but here are some of the things you might want to take into account:

The unit of parallelism is the partition, so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions. Add a percentage more that that to cope with peaks and variable infrastructure performance. Queuing Theory can give you the math to calculate your parallelism needs.
How bursty is your traffic and what latency constraints do you have? Considering the last point, if you also have latency requirements then you may need to scale out your partitions to cope with your peak rate of traffic.
If you use any data locality patterns or require ordering of messages then you need to consider future traffic growth. For example, you deal with customer data and use your customer id as a partition key, and depend on each customer always being routed to the same partition. Perhaps for event sourcing or simply to ensure each change is applied in the right order. Well, if you add new partitions later on to cope with a higher rate of messages, then each customer will likely be routed to a different partition now. This can introduce a few headaches regarding guaranteed message ordering as a customer exists on two partitions. So you want to create enough partitions for future growth. Just remember that is easy to scale out and in consumers, but partitions need some planning, so go on the safe side and be future proof.
Having thousands of partitions can increase overall latency.

Upvotes: 9

How to choose the no of partitions for a kafka topic?

Answers (6)

Partitions = max(NP, NC)

Related Questions