Does AQE physically duplicates data on one side of a join when tackling data skew on the other side?

Question

This figure is extracted from the Databricks' blog article "Adaptive Query Execution: Speeding Up Spark SQL at Runtime" in the "Dynamically optimizing skew joins" section:

The "Duplicate B0" label is confusing to me and it is not explicited in the article. Is there any replication of the shuffle files (corresponding to the partition B0) involved ?

ebonnal · Accepted Answer

There is no data replication, the only additional cost is that the B0 partition has to be fetched several times (1 time for each split of the A0 skewed partition), hence increasing the network cost of the job.

Source: The doc associated with the issue "[SPARK-29544] Optimize skewed join at runtime with new Adaptive Execution" that does not mention any replication and only warns with:

This approach will introduce the additional cost by reading N times about the partition 0 of table A. However the benefit of handing the skew join may be more than the cost.

Does AQE physically duplicates data on one side of a join when tackling data skew on the other side?

Answers (1)

Related Questions