Reputation: 11
I run 4 parallel threads on the driver node in Spark that do the same thing but with different data. Spark does execute all the submitted jobs in parallel up to the point where there is a join. At that point the join actions are done sequentially.
This is what the SparkUI shows:
Is there anything i can do to make the joins run in parallel?
The command I use to start the process is:
spark-submit
--master local[16]
--class ...
--driver-memory 11G
--conf spark.default.parallelism=4
--conf spark.sql.shuffle.partitions=4
I use only 4 partitions because the data I process is very small (2-3MB).
For the time being I am testing in local mode.
For production I will use an EMR cluster.
Upvotes: 0
Views: 351
Reputation: 11
The problem was that I was persisting the data immediately after the join. After removing the persist, the joins were done in parallel.
Upvotes: 1