Reputation: 81
Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.
When running the following via spark-submit (spark.default.parallelism is not set)
val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)
It returns value 2 for partition size.
When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.
What can be the reason ?
Thanks.
Upvotes: 7
Views: 39277
Reputation: 21
spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. Whereas spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame, the default value for this configuration set to 200.
For DataFrame, wider transformations like groupBy(), join() triggers the data shuffling hence the result of these transformations results in partition size same as the value set in spark.sql.shuffle.partitions. Prior to using these operations, use the below code to get the desired partitions (change the value according to your need) Try this spark.conf.set("spark.sql.shuffle.partitions",100)
Upvotes: 0
Reputation: 2448
spark.default.parallelism
defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism
.
When running spark-submit
, you're probably running it locally. Try submitting your spark-submit
with the same start up configs as you do the spark-shell.
Pulled this from the documentation:
spark.default.parallelism
For distributed shuffle operations like reduceByKey
and join
, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations like join
, reduceByKey
, and parallelize when not set by user.
Upvotes: 5