Spark executes joins sequentially even though they are submitted in parallel threads

Question

I run 4 parallel threads on the driver node in Spark that do the same thing but with different data. Spark does execute all the submitted jobs in parallel up to the point where there is a join. At that point the join actions are done sequentially. This is what the SparkUI shows:

Is there anything i can do to make the joins run in parallel?

UPDATE:

The command I use to start the process is:

spark-submit  
   --master local[16]  
   --class ...  
   --driver-memory 11G  
   --conf spark.default.parallelism=4  
   --conf spark.sql.shuffle.partitions=4

I use only 4 partitions because the data I process is very small (2-3MB).
For the time being I am testing in local mode. For production I will use an EMR cluster.

Flavius Dumitrascu · Accepted Answer

The problem was that I was persisting the data immediately after the join. After removing the persist, the joins were done in parallel.

Spark executes joins sequentially even though they are submitted in parallel threads

UPDATE:

Answers (1)

Related Questions