Reputation: 22388
Is there "spark.default.parallelism" equivalent for SparkSQL DataFrame for the narrrow transformation (map, filter, etc)?
Apparently, parition control is different between RDD and DataFrame. Dataframe has spark.sql.shuffle.partitions to control partitions for shuffing (wide transformation if I understand correctly) and "spark.default.parallelism" would have no effect.
How Spark dataframe shuffling can hurt your partitioning
But what does shuffling have to do with partitioning? Well, nothing really if you are working with RDDs…but with dataframes, that’s a different story. ... As you can see the partition number suddenly increases. This is due to the fact that the Spark SQL module contains the following default configuration: spark.sql.shuffle.partitions set to 200.
The article below suggests spark.default.parallelism would not work for Dataframe.
What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
The spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. But the spark.default.parallelism seems to only be working for raw RDD and is ignored when working with data frames.
Upvotes: 2
Views: 2440
Reputation: 13548
The narrow transformations (map
, filter
) preserve the number of partitions, which is why there is no need for a parallelism setting. A setting only makes sense for transformations that may affect the number of partitions.
Upvotes: 2