avgosim
avgosim

Reputation: 3

Kafka Topic design best approach

We are designing a integration using apache Kafka to send a critical business data . We have one producer and 5 consumers, so I have created one topic with 5 partitions to assign one partition for every consumer, however we need information delivered in the same order sent by producer and we was unable to achieve it. I read I only can achieve order by partition, so if I have only one partition I should be able todo it, but since I have 5 consumers i need partitions to paralelize the topic. So i think i must use topic keys but since order is only waranted by partition, I have some questions: If I use Keys in Kafka producer, I should send the payload specifyng the partition number (i.e in the producer code write message 5 times, one for each partition)?, or only by sending data with key to the topic, kafka replicates and write data in the same order in each partition?. example:

for(int i=0;i<=partitionsnumber;i++){ sendtoKafka(i,key,payload); }

In this case, should I use one topic for every consumer instead of partitions?

What is the best strategy to send data in the same order to all cosumers?

Note: The only key in the messages is of type string.

Upvotes: 0

Views: 3055

Answers (3)

Chetan Handa
Chetan Handa

Reputation: 330

Based on the given info it seems you want to send the same message to 5 consumers in a "fan out" type of a pattern.

Kafka can only achieve correct ordering of messages if those message are in the same partition so if you create 5 partitions under a topic the producer will by default distribute the messages to all the 5 partitions in a "round robbin" manner. This explains why you are not getting the right order of the messages.

Based on given info - seems you are thinking of adding keys to allow the messages to be sent to a specific partition , this implies that you will send the same message to the broker but with 5 different keys. In a way this would trick the system into maintaining the order per partition.

I'd suggest to not use that approach since you will end up duplicating the messages 5 times ; instead you can try a different approach by using the consumer group's default behavior.

Scenario #1 : Try using 1 topic with 1 partition. If you create 5 unique consumer groups with 1 consumer app in each then each consumer app will be able to read the data from the same topic in parallel.

Scenario #2: If u create 5 consumer apps and place them in the same consumer group you will not get the required "parallelism" since the consumer group will allow only 1 consumer app to read the data from 1 partition at 1 time, so the remaining 4 will be idle.

Scenario #3 :You may think of creating 5 topics with 1 partition and hookup 5 consumer apps , you will get parallelism but at the expense of duplicating the data.

So maybe scenario#1 may work best for you based on the information you have provided.

Upvotes: 0

Simran Singh
Simran Singh

Reputation: 52

I wasn't able to add the comment as it is quite long.

What you have mentioned in your comment that “we need an equal number of partitions for consumer application” is correct. However, it is only applicable if all the consumers(in your case its 5) comes under the same Consumer group.

For example, a topic T has 5 partitions, now suppose we create a consumer C1 with consumer group G1. Consumer c1 will get messages from all 5 partitions of Topic T. Then, we add consumer c2 under the same Consumer group G1. c1 will consume from 3 partitions and c2 will consume from the remaining 2 (It could be vice versa). Now what you have mentioned – “one partition per consumer application ” is an ideal scenario in this situation where 5 consumers under the same consumer group (G1) can consume from all 5 partitions parallel. This concept is called scalability.

Now, in your case you need the same data to be read 5 times because you have 5 consumers. In this case, instead of publishing the same messages to 5 partitions and then consume the same messages from all 5 consumers, you can write a simple producer app that publishes the data on a topic with 1 partition. Then, your 5 consumer apps can consume the same data independently I.e.I told you to assign all your consumer applications with random consumer-group names so that it will consume the messages independently ( as well as committing the offset).

Below the code snippet. Two consumer consuming messages from the same Topic(1 partition) parallelly:

Consumer 1:

 Properties props = new Properties();
    props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    props.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString()); // randomise consumer group for consumer 1. 
    props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
    props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
    props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
    props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

            KafkaConsumer consumerLiveVideo = new KafkaConsumer(props);
            consumerLiveVideo.subscribe(Collections.singletonList(topicName[0])); // topic with 1 partition

Consumer 2:

Properties props = new Properties();
    props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    props.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString()); // randomise consumer group for consumer 2 . 
    props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
    props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
    props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

            KafkaConsumer consumerLiveVideo = new KafkaConsumer(props);
            consumerLiveVideo.subscribe(Collections.singletonList(topicName[0])); // topic with 1 partition

You have also asked about the correct approach, according to me a single consumer application is all you need. Also, don’t mix the concepts of replication and scalability in Kafka as both of these are very critical.

Also, you have said about the critical data, you can read about Producer configuration parameter acks( use parameter acks =1 or acks=all based on your scenario).

For more details about the Scalability, Replication, Consumer Groups, Consumer/Producer/Brokers/Topics, please go through chapters 1-5 of Kafka The Definitive Guide.

Upvotes: 2

Simran Singh
Simran Singh

Reputation: 52

You need all your consumers to read the same messages published by the producer, right?

If that's the case, you don't have to publish/produce the same messages to all 5 partitions of your topic.

A simpler approach would be to create a topic with 1 partition and your producer app will publish all the messages to that topic/partition.

Now, you can easily create consumer applications with different consumer groups consuming data from the same topic. Assign some random id to your consumers and this way you will be able to consume from one topic/partition with all 5 consumers and be able to commit offsets.

Just add the below code snippet to all 5 consumer apps properties.

props.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString()); // randomise consumer group.

Let me know if you have any questions.

Upvotes: 0

Related Questions