Reputation: 1167
This figure is extracted from the Databricks' blog article "Adaptive Query Execution: Speeding Up Spark SQL at Runtime" in the "Dynamically optimizing skew joins" section:
The "Duplicate B0" label is confusing to me and it is not explicited in the article. Is there any replication of the shuffle files (corresponding to the partition B0) involved ?
Upvotes: 1
Views: 240
Reputation: 1167
There is no data replication, the only additional cost is that the B0 partition has to be fetched several times (1 time for each split of the A0 skewed partition), hence increasing the network cost of the job.
Source: The doc associated with the issue "[SPARK-29544] Optimize skewed join at runtime with new Adaptive Execution" that does not mention any replication and only warns with:
This approach will introduce the additional cost by reading N times about the partition 0 of table A. However the benefit of handing the skew join may be more than the cost.
Upvotes: 2