Reputation: 309
Scenario:
Kafka -> Spark Streaming
Logic in each Spark Streaming microbatch (30 seconds):
Read Json->Parse Json->Send to Kafka
My streaming job is reading from around 1000 Kafka topics, with around 10K Kafka partitions, the throughput was around 5 million events/s.
The issue is coming from uneven traffic load between Kafka partitions, some partitions have throughput about 50 times of smaller ones, this results in skewed RDD partitions (as KafkaUtils creates 1:1 mapping from Kafka partitions to Spark partitions) and really hurt the overall performance, because for each microbatch, most executors are waiting for the one that has largest load to finish, I know this by looking at Spark UI, at some point of each microbatch, there`s only a few executors have "ACTIVE" tasks, all other executors are done with their task and waiting, also by looking at task time distribution, MAX is 2.5 minute but MEDIAN is only 20 seconds.
Notes:
What I tried:
How do I make workload more evenly distributed between Spark executors so that resources are being used more efficiently? And performance would be better?
Upvotes: 1
Views: 1936
Reputation: 1129
I have the same issue.
you can try the minPartitoin
parameter from spark 2.4.7
Few things which are important to highlight.
So using minPartitons
and maxOffsetsPerTrigger
you can pre-calculate a good amount of partitions.
.option("minPartitions", partitionsNumberLoadedFromKafkaAdminAPI * splitPartitionFactor)
.option("maxOffsetsPerTrigger", maxEventsPerPartition * partitionsNumber)
maxEventsPerPartition
and splitPartitionFactor
defined from config.
In my case, sometimes I have data spikes and my message size can be very different. So I have implemented my own Streaming Source which can split kafka-partitions by exact record size and even coalesce a few kafka-parttiions on one spark.
Upvotes: 2
Reputation: 18108
Actually you have provided your own answer.
Do not have 1 Streaming Job reading from a 1000 topics. Put those with biggest load into separate Streaming Job(s). Reconfigure, that simple. Load balancing, queuing theory.
Stragglers are an issue in Spark, although a straggler takes on a slightly different trait in Spark.
Upvotes: 0