How to launch parallel spark jobs?

Question

I think i don't understand enough how to launch jobs.

I have one job which takes 60 seconds to finish. I run it with following command:

spark-submit --executor-cores 1 \
             --executor-memory 1g \
             --driver-memory 1g \
             --master yarn \
             --deploy-mode cluster \
             --conf spark.dynamicAllocation.enabled=true \
             --conf spark.shuffle.service.enabled=true \
             --conf spark.dynamicAllocation.minExecutors=1 \
             --conf spark.dynamicAllocation.maxExecutors=4 \
             --conf spark.dynamicAllocation.initialExecutors=4 \
             --conf spark.executor.instances=4 \

If i increase number of partitions from code and number of executors the app will finish faster, which it's ok. But if i increase only executor-cores the finish time is the same, and i don't understand why. I expect the time to be lower than initial time.

My second problem is if i launch twice above code i expect that both jobs to finish in 60 seconds, but this don't happen. Both jobs finish after 120 seconds and i don't understand why.

I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 threads). From what i saw in default EMR configurations, yarn is set on FIFO(default) mode with CapacityScheduler.

What do you think about this problems?

Assaf Mendelson · Accepted Answer

Spark creates partitions based on a logic inside the data source. In your case it probably creates a number of partitions which is smaller than the number of executors * executor cores so just increasing the cores will not make it run faster as those would be idle. When you also increase the number of partitions it can work faster.

When you run spark-submit twice, there is a good chance that the dynamic allocation reaches the maximum allocation of executors before the second one starts (it takes ~4 seconds by default in your case). Depending on how yarn was defined, this might fill up all of the available executors (either because the number of threads defined is too small or because memory is filled up). In any case if this indeed happens, the second spark-submit would not start processing until some executor is freed meaning it takes the sum of times.

BTW remember that in cluster mode, the driver takes up an executor too...

How to launch parallel spark jobs?

Answers (1)

Related Questions