Vinayak Mishra
Vinayak Mishra

Reputation: 371

Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Their are 32 Kafka partitions and 32 consumers as per Direct approach. But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka.

I Want to increase the number of partitions for Dstream received by each consumer.

I will like solution to be something around to increase partitions on consumers rather then increasing partitions in Kafka.

Upvotes: 0

Views: 669

Answers (2)

AbhishekN
AbhishekN

Reputation: 368

In the direct stream approach, at max you can have #consumers = #partitions. Kafka does not allow more than one consumer per partition per group.id. BTW you are asking more partition per consumer? it will not help since your consumers are already running at full capacity and still are insufficient.

Few technical changes you can try to reduce the data backlog on kafka:

  1. Increase number of partitions - although you do not want to do this, still this is the easiest approach. Sometimes platform just needs more hardware.

  2. Optimize processing at consumer side - check possibility of record de-duplication before processing, reduce disk I/O, loop unrolling techniques etc to reduce time taken by consumers.

  3. (higher difficulty) Controlled data distribution - Often it is found that some partitions are able to process better than others. It may be worth looking if this is the case in your platform. Kafka's data distribution policy has some preferences (as well as message-key) which often cause uneven load inside cluster: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html

Upvotes: 0

Sachin
Sachin

Reputation: 39

Assuming you have enough hardware resources allocated to consumer, you can check below parameter

spark.streaming.kafka.maxRatePerPartition

You can set number of records you consume from single kafka partition per second.

Upvotes: 0

Related Questions