Reputation: 81

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.

When running the following via spark-submit (spark.default.parallelism is not set)

val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)

It returns value 2 for partition size.

When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.

What can be the reason ?

Thanks.

Upvotes: 7

Answers (2)

Suravi Mahanta

Reputation: 21

spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. Whereas spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame, the default value for this configuration set to 200.

For DataFrame, wider transformations like groupBy(), join() triggers the data shuffling hence the result of these transformations results in partition size same as the value set in spark.sql.shuffle.partitions. Prior to using these operations, use the below code to get the desired partitions (change the value according to your need) Try this spark.conf.set("spark.sql.shuffle.partitions",100)

Upvotes: 0

Joe Widen

Reputation: 2448

spark.default.parallelism defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism.

When running spark-submit, you're probably running it locally. Try submitting your spark-submit with the same start up configs as you do the spark-shell.

Pulled this from the documentation:

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine

Mesos fine grained mode: 8

Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

Upvotes: 5

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Answers (2)

Related Questions