Co-partitioned joins in spark SQL

Question

Are there any implementations of Spark SQL DataSources that offer Co-partition joins - most likely via the CoGroupRDD? I did not see any uses within the existing Spark codebase.

The motivation would be to greatly reduce the shuffle traffic in the case that two tables have the same number and same ranges of partitioning keys: in that case there would be a Mx1 instead of an MxN shuffle fanout.

The only large-scale implementation of joins presently in Spark SQL seems to be ShuffledHashJoin - which does require the MxN shuffle fanout and thus is expensive.

Michael Armbrust · Accepted Answer

I think you are looking for the Bucket Join optimization that should be coming in Spark 2.0.

In 1.6 you can accomplish something similar, but only by caching the data. SPARK-4849

Co-partitioned joins in spark SQL

Answers (1)

Related Questions