Reputation: 371
Their are 32 Kafka partitions and 32 consumers as per Direct approach. But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka.
I Want to increase the number of partitions for Dstream received by each consumer.
I will like solution to be something around to increase partitions on consumers rather then increasing partitions in Kafka.
Upvotes: 0
Views: 669
Reputation: 368
In the direct stream approach, at max you can have #consumers = #partitions. Kafka does not allow more than one consumer per partition per group.id. BTW you are asking more partition per consumer? it will not help since your consumers are already running at full capacity and still are insufficient.
Few technical changes you can try to reduce the data backlog on kafka:
Increase number of partitions - although you do not want to do this, still this is the easiest approach. Sometimes platform just needs more hardware.
Optimize processing at consumer side - check possibility of record de-duplication before processing, reduce disk I/O, loop unrolling techniques etc to reduce time taken by consumers.
(higher difficulty) Controlled data distribution - Often it is found that some partitions are able to process better than others. It may be worth looking if this is the case in your platform. Kafka's data distribution policy has some preferences (as well as message-key) which often cause uneven load inside cluster: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html
Upvotes: 0
Reputation: 39
Assuming you have enough hardware resources allocated to consumer, you can check below parameter
spark.streaming.kafka.maxRatePerPartition
You can set number of records you consume from single kafka partition per second.
Upvotes: 0