laxman
laxman

Reputation: 2120

creating a topic with partitions more than the brokers

I have 3 brokers set up in my local machine, and I create a topic with 5 partitons, bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 3 --partitions 5 --topic test-partitions

and then to describe my test-partitions topic,

bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic test-partitions

which results in

Topic: test-partitions  PartitionCount: 5   ReplicationFactor: 3    Configs: segment.bytes=1073741824
    Topic: test-partitions  Partition: 0    Leader: 1   Replicas: 1,2,0 Isr: 1,2,0
    Topic: test-partitions  Partition: 1    Leader: 0   Replicas: 0,1,2 Isr: 0,1,2
    Topic: test-partitions  Partition: 2    Leader: 2   Replicas: 2,0,1 Isr: 2,0,1
    Topic: test-partitions  Partition: 3    Leader: 1   Replicas: 1,0,2 Isr: 1,0,2
    Topic: test-partitions  Partition: 4    Leader: 0   Replicas: 0,2,1 Isr: 0,2,1

Kafka doesn't raises any error here.

Now, when I use producer/consumer API's everything works fine but what I am not able to understand is, within the configuration both at producer/consumer end nowhere I define the partiton to connect to. My question is how does kafka decide which partition to push message to within the same broker when I have multiple partitions for same topic? Isn't this an inconsistent behaviour?

Upvotes: 1

Views: 1615

Answers (3)

Rohit Yadav
Rohit Yadav

Reputation: 2578

In Kafka, spreading/distributing the data over multiple machines deals with partitions (not individual records). Scenario depicted below

enter image description here

Data and Partitions: the relationship

When a producer sends data, it goes to a topic – but that’s 50,000 foot view. You must understand that

data is actually a key-value pair its storage happens at a partition level

Default behavior

The data for the same key goes to the same partition since Kafka uses a consistent hashing algorithm to map key to partitions.

hash(key) % number_of_partitions

In case of a null key (yes, that’s possible), the data is randomly placed on any of the partitions. If you only have one partition: ALL data goes to that single partition

enter image description here

You can plugin a custom algorithm to partition the data.

by implementing the Partitioner interface & configure Kafka producer to use it.

Here is an example

public class RandomKakfaPartitioner implements Partitioner {

    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object val, byte[] valBytes, Cluster cluster) {

          // logic to calculate targetPartition

          return targetPartition;
    }

    @Override
    public void close() {
        //no-op
    }

    @Override
    public void configure(Map<String, ?> map) {
        //no-op
    }

https://www.confluent.io/blog/apache-kafka-producer-improvements-sticky-partitioner/

Upvotes: 1

Michael Heil
Michael Heil

Reputation: 18525

Kafka doesn't raises any error here.

There is not need to raise an error. I guess you are mixing up partitioning with replication. You can theoretically have as many partitions as you like, and only the replication-factor is usually linked to the number of brokers.

How does kafka decide which partition to push message to within the same broker when I have multiple partitions for same topic?

If a message has a key, Kafka will hash the key and use the result to map the message to a specific Partition, basically:

hash(key) % number_of_partitions

This means that all messages with the same key will go to the same Partition. If the key is null and the default Partitioner is used, the record will be sent to a random Partition (using a round-robin algorithm).

However, You may override this behavior and provide your own Partitioning scheme. For example, if you will have many messages with a particular key, you may want one Partition to hold just those messages. This would ensure that all messages with the same key stay in order when consuming the messages.

Upvotes: 1

Manoj Vadehra
Manoj Vadehra

Reputation: 846

You are getting confused between two concepts:

Partitioning and Replication

Partitions are simply "Parts" of the same topic and these don't have a dependency on the number of available brokers.

When a producer produces data, you can produce by using Round Robin, Hashing mechanisms - These are default strategies read about these in Kafka documentation. Other option is to specify a partition to which you want to produce data (if you are logically separating the data).

In case of consumers, if you don't "assign" a partition manually, Kafka automatically "assigns" all partitions to the instance of your consumer.

If you have multiple consumers, Kafka tries to balance the connections i.e. if you have 5 partitions and 2 consumers (With the same consumer group), 1st consumer may get data from 2 partitions and the 2nd consumer will get data from the remaining 3 partitions or the other way around.

Also the number of partitions determine the degree of parallelism you can achieve for a single topic with the same consumer group. e.g: 50 partition topic data can be processed in parallel by a maximum of 50 consumer instances (1 connection each).

Replication on the other hand has an obvious dependency on number of brokers available as it's not logical to have 4 replications on 3 brokers i.e. one broker will end up having 2 copies of the same data.

Replication refers to the entire topic i.e. all partitions are replicated according to the "replication-factor".

So, if you have 5 partitions and replication factor of 3, you essentially have 3 copies of all 5 partitions.

Read more here: https://kafka.apache.org/documentation/

Hope this helps :)

Upvotes: 2

Related Questions