Spark performs task with not enough parallelism

I am a begginer in Spark and I am a bit confuse about the behaviour of Spark.

I am developing an algorithm in Scala, in this method I create an RDD with a number of partitions specified by the user in this way:

val fichero = sc.textFile(file, numPartitions)

I am developing under a cluster with 12 workers and 216 cores available (18 per node). But when I go to the Spark UI to debug the application I saw the following event timeline for a given stage:

Spark Event TimeLine of a stage

Sorry for the quality of the image, but I have to low the zoom a lot. In this execution, there are 128 partitions. But, as can be observed in the image, the whole RDD is executed in only two out of twelve executors available, so some task are executed sequentially and I don't want that behaviour.

So the question is: What's happening here? Could I use all workers in order to execute each task in parallel? I have seen the option:

spark.default.parallelism

But this option is modified when choosing the number of partition to use. I am launching the application with the defaults parameters of the spark-submit script.

Upvotes: 3

Views: 3276

Answers (2)

Artem Aliev
Artem Aliev

Reputation: 1407

numPartition is a hint not a requirement. It is finally passed to InputFormat https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapred/FileInputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf, int) You can always check the actual number of partitions with

val fichero = sc.textFile(file, numPartitions)
fichero.partitions.size

Upvotes: 1

Raphael Roth
Raphael Roth

Reputation: 27373

You should set --num-executors to a higher number (default is 2), you should also look at --executor-cores which is 1 by default. Try e.g. --num-executors 128.

Make sure that your number of partitions is a multiple (I normally use 2 or 4, depending on the resources needed) of "the number of executors times the number of cores per executor".

See spark-submit --help and for further reading, I can recommend to have a look at this (especially "tuning parallelism") : http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Upvotes: 2

Related Questions