What is the most efficient way to join a very large dataframe(1000300 rows) and a relatively smaller dataframe(6090 rows) in Spark SQL?

Question

In Pig Latin for this purpose we have a special sort of join called fragment replicate join to join a very large relation to a smaller relation. Is there any way to perform an efficient join between a very large dataframe and a smaller dataframe in SparkSQL similar to the one in PigLatin.

T. Gawęda · Accepted Answer

This is called Broadcast join.

If your DataFrame's size is below spark.sql.autoBroadcastJoinThreshold, Spark will automatically use this type of join. If not, wrap your DataFrame inside broadcast function:

import org.apache.spark.sql.functions._
df1.join(broadcast(df2))

Broadcast Hash Join broadcasts DataFrame to all nodes, which makes this join very fast, there is no shuffle

What is the most efficient way to join a very large dataframe(1000300 rows) and a relatively smaller dataframe(6090 rows) in Spark SQL?

Answers (1)

Related Questions