Krever
Krever

Reputation: 1467

Spark streaming + kafka throughput

In my spark application I'm reading from kafka topic. This topic has 10 partitions so I've created 10 receivers with one thread per receiver. With such configuration I can can observe weird behavior of the receivers. Median rates for these consumers are:

Receiver-0 node-1 10K
Receiver-1 node-2 2.5K
Receiver-2 node-3 2.5K
Receiver-3 node-4 2.5K
Receiver-4 node-5 2.5K
Receiver-5 node-1 10K
Receiver-6 node-2 2.6K
Receiver-7 node-3 2.5K
Receiver-8 node-4 2.5K
Receiver-9 node-5 2.5K

Problem 1: node-1 is receiving as many messages as the other 4 together.

Problem 2: App is not reaching batch performance limit(30 sec batches are computed in median time of 17 sec). I would like it to consume enough messages to make this at least 25 sec of computation time.

Where I should look for the bottleneck ?

To be clear, there are more messages to be consumed.

@Edit: I had lag on only two partitions, so the first problem is solved. Still, reading 10k msgs per second is not very much.

Upvotes: 3

Views: 1229

Answers (1)

Timomo
Timomo

Reputation: 176

Use Sparks built in backpressure (since Spark 1.5, which wasn't available at the time of your question): https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-streaming-backpressure.adoc

Just set

spark.streaming.backpressure.enabled=true
spark.streaming.kafka.maxRatePerPartition=X (really high in your case)

To find the bottleneck you should use the WebUI of Sparkstreaming and look at the DAG of the process taking most of the time...

Upvotes: 1

Related Questions