mon
mon

Reputation: 22388

RDD spark.default.parallelism equivalent for Spark Dataframe

Question

Is there "spark.default.parallelism" equivalent for SparkSQL DataFrame for the narrrow transformation (map, filter, etc)?

Background

Apparently, parition control is different between RDD and DataFrame. Dataframe has spark.sql.shuffle.partitions to control partitions for shuffing (wide transformation if I understand correctly) and "spark.default.parallelism" would have no effect.

How Spark dataframe shuffling can hurt your partitioning

But what does shuffling have to do with partitioning? Well, nothing really if you are working with RDDs…but with dataframes, that’s a different story. ... As you can see the partition number suddenly increases. This is due to the fact that the Spark SQL module contains the following default configuration: spark.sql.shuffle.partitions set to 200.

The article below suggests spark.default.parallelism would not work for Dataframe.

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

The spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. But the spark.default.parallelism seems to only be working for raw RDD and is ignored when working with data frames.

Upvotes: 2

Views: 2440

Answers (1)

Sim
Sim

Reputation: 13548

The narrow transformations (map, filter) preserve the number of partitions, which is why there is no need for a parallelism setting. A setting only makes sense for transformations that may affect the number of partitions.

Upvotes: 2

Related Questions