Future
Future

Reputation: 1

When will the spark operation of join does not cause shuffle

In general,the join operation of Spark will cause shuffle. And when will the operation of join will not cause shuffle? And who can tell me some methods to optimize for Spark?

Upvotes: 0

Views: 468

Answers (1)

user8003557
user8003557

Reputation: 1

  • join will not cause shuffle directly if both data structures (either Dataset or RDD) are already co-partitioned. This means that data has been already shuffled with repartition / partitionBy or aggregation and partitioning schemes are compatible (the same partitioning key and number of partitions).

  • join will not cause network traffic if both structures are both co-partitioned and co-located. Since co-location happens only if data has been previously shuffled in the same actions this is a bordercase scenario.

  • Also shuffle doesn't occur when join is expressed as broadcast join.

Upvotes: 0

Related Questions