prateek jangid
prateek jangid

Reputation: 85

Kafka listener concurrency threads taking time to start in parallel?

I need to process around 50k records(this number can vary from 100 to 50k max) in same topic. Therefore, i used concurrency feature of kafka.Below is my configuration and listener code.

@KafkaListener(topics = {"kafkaTopic"},
                containerFactory = "abcd")
        public void consume(
                @Payload List<String> message,
                @Header(KafkaHeaders.RECEIVED_TOPIC) String topic
        ) throws IOException {
    
            StopWatch st = new StopWatch();
     DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");
            Date date = new Date();
            StringBuilder str = new StringBuilder();
            st.start("threadName-");
            message.forEach(messages -> {
                try {
                    Thread.sleep(2500);
                    logger.info("message is-{}", messages);
                    str.append(messages);
                    str.append(",");
                } catch (Exception e) {
                    str.append("exception-{}" + e);
                }
            });
    
            st.stop();
           
            List data = objectMapper.readValue(getFile(), new TypeReference<List<String>>() {});
    
            str.append("----thread-" + Thread.currentThread().getName() + "started at time-"+dateFormat.format(date)+" and time taken-" + String.format("%.2f", st.getTotalTimeSeconds()));
            str.append("---");
            data.add(str);
            objectMapper.writeValue(getFile(),
                    data);
    
        }


    @Bean("abcd")
        public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
            ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<>();
            factory.setConsumerFactory(consumerFactory());
            factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH);
            factory.setConcurrency(5);
            factory.setBatchListener(true);
            return factory;
        }
    
        @Bean
        public NewTopic syliusDeTopic() {
            return TopicBuilder.name("kafkaTopic").partitions(5).replicas(2).build();
        }

@Bean
    public ConsumerFactory<String, String> consumerFactory() {
        Map<String, Object> configProps = new HashMap<>();
        configProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, server);
        configProps.put(ConsumerConfig.GROUP_ID_CONFIG, consumerGroupId);
        configProps.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        configProps.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        configProps.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, CustomCooperativeStickyAssignor.class.getName());
        configProps.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG,"500");
        configProps.put(ConsumerConfig.FETCH_MIN_BYTES_CONFIG,"1");
        configProps.put(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG,"5000");
        return new DefaultKafkaConsumerFactory<>(configProps);
    }

But when i checked the result for sample 100 records, threads did not started at same time. Below is the response for the same.

["test-0,test-1,test-2,----thread-org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1started at time-2023/01/06 22:20:19 and time taken-7.51---","test-56,test-57,test-58,test-59,test-60,test-61,----thread-org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1started at time-2023/01/06 22:20:26 and time taken-15.02---","test-70,test-71,test-72,test-73,test-74,test-75,test-76,test-77,test-78,----thread-org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1started at time-2023/01/06 22:20:34 and time taken-22.53---","test-62,test-63,test-64,test-65,test-66,test-67,test-68,test-69,test-85,test-86,test-87,test-88,test-89,test-90,test-91,----thread-org.springframework.kafka.KafkaListenerEndpointContainer#0-2-C-1started at time-2023/01/06 22:20:49 and time taken-37.55---","test-79,test-80,test-81,test-82,test-83,test-84,test-92,test-93,test-94,test-95,test-96,test-97,test-98,test-99,----thread-org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1started at time-2023/01/06 22:21:01 and time taken-35.05---","test-3,test-4,test-5,test-6,test-7,test-8,test-9,test-10,test-11,test-12,test-13,test-14,test-15,test-16,test-17,test-18,test-19,test-20,test-21,test-22,test-23,test-24,test-25,test-26,test-27,test-28,test-29,test-30,test-31,test-32,test-33,test-34,test-35,test-36,test-37,test-38,test-39,test-40,test-41,test-42,test-43,test-44,test-45,test-46,test-47,test-48,test-49,test-50,test-51,test-52,test-53,test-54,test-55,----thread-org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1started at time-2023/01/06 22:22:24 and time taken-132.69---"]

The started time of threads are different with difference of around >80 secs between first thread and last.

Any idea how to resolve this.I want the thread to run at almost same time (thread count can increase to max 15) which can improve the ingestion of huge records?

Also, data added in partition of varied size.Can it be resolved as well?

Upvotes: 0

Views: 582

Answers (1)

Artem Bilan
Artem Bilan

Reputation: 121177

Your started at time assumption is not correct. That's really not a time when concurrent containers are started, but rather the time when they consume records from partitions assigned to them. So, you may just not have any data in the partition to consume at the moment. More over there is no guarantee that all the partitions are assigned to consumers at the same time. So, one consumer may have already started to consume, but other has not obtained assigned to them partitions yet.

To evenly distribute data between partitions you need to look into a messageKey and Partitioner abstraction on the producer side.

Upvotes: 1

Related Questions