Reputation: 383
I am new to Spark SQL. I have a question on partition usage during Joins
Assume that there is a table named test1
that saved on 10
partitions (parquet) files. Also assume that spark.sql.shuffle.partitions = 200
.
Question:
If test1 is used to Join
to another table, will Spark perform the operation using 10
partitions (which is the number of partitions the table resides), or will it repartition the table anyway in 200
partitions (as per shuffle partition value) and then perform the join ? in which case the join will yield better performance. If the answer is that the join will be performed using the 10
partitions, isn't it better to always repartition (CLUSTER BY
) the joining table to a higher number of partitions for better Join performance ?
In the Spark UI I have seen some stages
using only 10
tasks
, while other stages
using 200
tasks.
Can someone please help me understand.
Thanks
Upvotes: 0
Views: 1068
Reputation: 3008
Spark will read the data in 10 partitions on 10 tasks and similarly it would read the other data frame partitions used in the join and once it has all the data it would create 200 partitions which is the default value for shuffle partitions. So this is the reason you see 10 tasks in one stage, then some other tasks in different stage and then finally 200 tasks after the shuffle operation. So at last after join you would have 200 partitions by default unless you have set that to different value in spark configuration.
Upvotes: 0
Reputation: 27383
Spark will use 200 partitions in most cases (SortMergeJoin,ShuffleHashJoin), unless spark estimates your table to be small enough for a BroadcastHashJoin
Upvotes: 0