Spark performs task with not enough parallelism

Question

I am a begginer in Spark and I am a bit confuse about the behaviour of Spark.

I am developing an algorithm in Scala, in this method I create an RDD with a number of partitions specified by the user in this way:

val fichero = sc.textFile(file, numPartitions)

I am developing under a cluster with 12 workers and 216 cores available (18 per node). But when I go to the Spark UI to debug the application I saw the following event timeline for a given stage:

Sorry for the quality of the image, but I have to low the zoom a lot. In this execution, there are 128 partitions. But, as can be observed in the image, the whole RDD is executed in only two out of twelve executors available, so some task are executed sequentially and I don't want that behaviour.

So the question is: What's happening here? Could I use all workers in order to execute each task in parallel? I have seen the option:

spark.default.parallelism

But this option is modified when choosing the number of partition to use. I am launching the application with the defaults parameters of the spark-submit script.

Raphael Roth · Accepted Answer

You should set --num-executors to a higher number (default is 2), you should also look at --executor-cores which is 1 by default. Try e.g. --num-executors 128.

Make sure that your number of partitions is a multiple (I normally use 2 or 4, depending on the resources needed) of "the number of executors times the number of cores per executor".

See spark-submit --help and for further reading, I can recommend to have a look at this (especially "tuning parallelism") : http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Spark performs task with not enough parallelism

Answers (2)

Related Questions