Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Question

Their are 32 Kafka partitions and 32 consumers as per Direct approach. But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka.

I Want to increase the number of partitions for Dstream received by each consumer.

I will like solution to be something around to increase partitions on consumers rather then increasing partitions in Kafka.

AbhishekN · Accepted Answer

In the direct stream approach, at max you can have #consumers = #partitions. Kafka does not allow more than one consumer per partition per group.id. BTW you are asking more partition per consumer? it will not help since your consumers are already running at full capacity and still are insufficient.

Few technical changes you can try to reduce the data backlog on kafka:

Increase number of partitions - although you do not want to do this, still this is the easiest approach. Sometimes platform just needs more hardware.
Optimize processing at consumer side - check possibility of record de-duplication before processing, reduce disk I/O, loop unrolling techniques etc to reduce time taken by consumers.
(higher difficulty) Controlled data distribution - Often it is found that some partitions are able to process better than others. It may be worth looking if this is the case in your platform. Kafka's data distribution policy has some preferences (as well as message-key) which often cause uneven load inside cluster: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html

Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Answers (2)

Related Questions