Reputation: 3714
Can anyone explain:
My use case is, I am considering to send lot of ship data to brokers and store it by ship_id
(MMSI, if you know) as key. The problem is, I dont know how many ship will be received then. So I can't define partition number in advance.
Upvotes: 2
Views: 448
Reputation: 3832
1. Kafka messages are form of key and value and it stored into in topics. Topics are partitioned into multiple partitioner and each partition further divided into segment each segment has a log file to store the actual message in key - value form and index or offset of the message.
Key is optional which is used to identify partition going to store message if key is null then message stored into round-robin way whereas if key is not null then it will use hash key with module partition size which guarantee to choose one of the partition. e.g.
hash(key)%num_partition
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyBytes == null) {
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
return Utils.toPositive(nextValue) % numPartitions;
}
} else {
// hash the keyBytes to choose a partition
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
}
So since its use module it will message always be stores in the range of available partitions and thats reason multiple keys may go to same partition. The main benefit of message key is to bucketing same message key should go to same partition.
2. So you no need to worry about number of partitions can be defined based on number of key. As mentioned above key is use to bucketing the messages to different partition based on Default partitioner logic. Partition number basically help to parallelize the process to high throughput.
Note:You also make sure by using key for partitioned data may cause unequal distribution so if you don't worry just keep key null which select partition on round-robin
Other approach is to create custom partitioner to further refine partition selection logic. here
Upvotes: 0
Reputation: 191743
is it possible that a partition stores messages with multiple keys?
Yes, the murmur2 hash (algorithm used by Kafka), mod the number of partitions in a topic can result in the same number. For example, if you have only one partition, any key obviously goes to the same partition
how if the number of key is more than partition available?
The hash is modulo'd, so it always is assigned a valid partition
Now, if you have a well defined key, you are guaranteed ordering of messages into partitions, so the answer to the number of partitions really comes down to how much throughput a single partition can handle, and there is no short answer - how much data are you sending and how fast can one consumer get that data from one partition at "peak" consumption? Do appropriate performance tests, then scale the partition number up over new topics to handle potential future load
You'll also need to consider "hot" / "cold" data. If you have 10 partitions for example that mapped to the first digit of the ID, then all your data started with even numbers, you'd end up with half of the partitions being empty
Upvotes: 1