Reputation: 2120
I have 3 brokers set up in my local machine, and I create a topic with 5
partitons,
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 3 --partitions 5 --topic test-partitions
and then to describe my test-partitions
topic,
bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic test-partitions
which results in
Topic: test-partitions PartitionCount: 5 ReplicationFactor: 3 Configs: segment.bytes=1073741824
Topic: test-partitions Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Topic: test-partitions Partition: 1 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
Topic: test-partitions Partition: 2 Leader: 2 Replicas: 2,0,1 Isr: 2,0,1
Topic: test-partitions Partition: 3 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
Topic: test-partitions Partition: 4 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1
Kafka doesn't raises any error here.
Now, when I use producer/consumer API's everything works fine but what I am not able to understand is, within the configuration both at producer/consumer end nowhere I define the partiton to connect to. My question is how does kafka decide which partition to push message to within the same broker when I have multiple partitions for same topic? Isn't this an inconsistent behaviour?
Upvotes: 1
Views: 1615
Reputation: 2578
In Kafka, spreading/distributing the data over multiple machines deals with partitions (not individual records). Scenario depicted below
Data and Partitions: the relationship
When a producer sends data, it goes to a topic – but that’s 50,000 foot view. You must understand that
data is actually a key-value pair its storage happens at a partition level
Default behavior
The data for the same key goes to the same partition since Kafka uses a consistent hashing algorithm to map key to partitions.
hash(key) % number_of_partitions
In case of a null key (yes, that’s possible), the data is randomly placed on any of the partitions. If you only have one partition: ALL data goes to that single partition
You can plugin a custom algorithm to partition the data.
by implementing the Partitioner interface & configure Kafka producer to use it.
Here is an example
public class RandomKakfaPartitioner implements Partitioner {
@Override
public int partition(String topic, Object key, byte[] keyBytes, Object val, byte[] valBytes, Cluster cluster) {
// logic to calculate targetPartition
return targetPartition;
}
@Override
public void close() {
//no-op
}
@Override
public void configure(Map<String, ?> map) {
//no-op
}
https://www.confluent.io/blog/apache-kafka-producer-improvements-sticky-partitioner/
Upvotes: 1
Reputation: 18525
Kafka doesn't raises any error here.
There is not need to raise an error. I guess you are mixing up partitioning with replication. You can theoretically have as many partitions as you like, and only the replication-factor is usually linked to the number of brokers.
How does kafka decide which partition to push message to within the same broker when I have multiple partitions for same topic?
If a message has a key, Kafka will hash the key and use the result to map the message to a specific Partition, basically:
hash(key) % number_of_partitions
This means that all messages with the same key will go to the same Partition. If the key is null and the default Partitioner is used, the record will be sent to a random Partition (using a round-robin algorithm).
However, You may override this behavior and provide your own Partitioning scheme. For example, if you will have many messages with a particular key, you may want one Partition to hold just those messages. This would ensure that all messages with the same key stay in order when consuming the messages.
Upvotes: 1
Reputation: 846
You are getting confused between two concepts:
Partitioning and Replication
Partitions are simply "Parts" of the same topic and these don't have a dependency on the number of available brokers.
When a producer produces data, you can produce by using Round Robin, Hashing mechanisms - These are default strategies read about these in Kafka documentation. Other option is to specify a partition to which you want to produce data (if you are logically separating the data).
In case of consumers, if you don't "assign" a partition manually, Kafka automatically "assigns" all partitions to the instance of your consumer.
If you have multiple consumers, Kafka tries to balance the connections i.e. if you have 5 partitions and 2 consumers (With the same consumer group), 1st consumer may get data from 2 partitions and the 2nd consumer will get data from the remaining 3 partitions or the other way around.
Also the number of partitions determine the degree of parallelism you can achieve for a single topic with the same consumer group. e.g: 50 partition topic data can be processed in parallel by a maximum of 50 consumer instances (1 connection each).
Replication on the other hand has an obvious dependency on number of brokers available as it's not logical to have 4 replications on 3 brokers i.e. one broker will end up having 2 copies of the same data.
Replication refers to the entire topic i.e. all partitions are replicated according to the "replication-factor".
So, if you have 5 partitions and replication factor of 3, you essentially have 3 copies of all 5 partitions.
Read more here: https://kafka.apache.org/documentation/
Hope this helps :)
Upvotes: 2