Reputation: 1
In general,the join operation of Spark will cause shuffle. And when will the operation of join will not cause shuffle? And who can tell me some methods to optimize for Spark?
Upvotes: 0
Views: 468
Reputation: 1
join
will not cause shuffle directly if both data structures (either Dataset
or RDD
) are already co-partitioned. This means that data has been already shuffled with repartition
/ partitionBy
or aggregation and partitioning schemes are compatible (the same partitioning key and number of partitions).
join
will not cause network traffic if both structures are both co-partitioned and co-located. Since co-location happens only if data has been previously shuffled in the same actions this is a bordercase scenario.
Also shuffle doesn't occur when join is expressed as broadcast join.
Upvotes: 0