Rajnil Guha
Rajnil Guha

Reputation: 435

What is the most efficient way to join a very large dataframe(1000300 rows) and a relatively smaller dataframe(6090 rows) in Spark SQL?

In Pig Latin for this purpose we have a special sort of join called fragment replicate join to join a very large relation to a smaller relation. Is there any way to perform an efficient join between a very large dataframe and a smaller dataframe in SparkSQL similar to the one in PigLatin.

Upvotes: 0

Views: 52

Answers (1)

T. Gawęda
T. Gawęda

Reputation: 16086

This is called Broadcast join.

If your DataFrame's size is below spark.sql.autoBroadcastJoinThreshold, Spark will automatically use this type of join. If not, wrap your DataFrame inside broadcast function:

import org.apache.spark.sql.functions._
df1.join(broadcast(df2))

Broadcast Hash Join broadcasts DataFrame to all nodes, which makes this join very fast, there is no shuffle

Upvotes: 1

Related Questions