Reputation: 123
I have two large json files that we stream over kafka. So one of the file pushed in Patition 0 of Topic1 and another file pushed in partition 1 of Topic1. We use spark streaming query using watermark to join these files and perform necessary calculation. Though we are joining these files and perform simple calcuations, we do find in Spark UI that there are more than 200 tasks been done by Spark engine which takes more than 6 mins.These are the stats on a box which has 2 cores and 8 GB Ram.
Below are few questions we have: 1) Why there are so many tasks for these simple operations? 2) Does large JSON splits among mutilple executors? As per my understaning, its not possible to perform operations on the split part of JSON. It has to be on one the executor. Does it mean that we cannot split large xml or json among multiple executors to increase parallelism?
Thanks
Upvotes: 2
Views: 775
Reputation: 16086
It's all about partitions:
200 is default value of Spark Shuffle Partition parameter, which defines partitions after shuffle. In your case, join is causing an shuffle You can change it using spark.sql.shuffle.partitions
In Kafka source, number of partitions in Kafka = number of partitions in Spark (on master there is merged PR which can set number of partitions = x * partitions in Kafka, where you can define x - it's not released yet AFAIR)
Upvotes: 1