Flavius Dumitrascu
Flavius Dumitrascu

Reputation: 11

Spark executes joins sequentially even though they are submitted in parallel threads

I run 4 parallel threads on the driver node in Spark that do the same thing but with different data. Spark does execute all the submitted jobs in parallel up to the point where there is a join. At that point the join actions are done sequentially. This is what the SparkUI shows: enter image description here

Is there anything i can do to make the joins run in parallel?

UPDATE:

The command I use to start the process is:

spark-submit  
   --master local[16]  
   --class ...  
   --driver-memory 11G  
   --conf spark.default.parallelism=4  
   --conf spark.sql.shuffle.partitions=4

I use only 4 partitions because the data I process is very small (2-3MB).
For the time being I am testing in local mode. For production I will use an EMR cluster.

Upvotes: 0

Views: 351

Answers (1)

Flavius Dumitrascu
Flavius Dumitrascu

Reputation: 11

The problem was that I was persisting the data immediately after the join. After removing the persist, the joins were done in parallel.

Upvotes: 1

Related Questions