Reputation: 435
In Pig Latin for this purpose we have a special sort of join called fragment replicate join to join a very large relation to a smaller relation. Is there any way to perform an efficient join between a very large dataframe and a smaller dataframe in SparkSQL similar to the one in PigLatin.
Upvotes: 0
Views: 52
Reputation: 16086
This is called Broadcast join.
If your DataFrame's size is below spark.sql.autoBroadcastJoinThreshold
, Spark will automatically use this type of join. If not, wrap your DataFrame inside broadcast
function:
import org.apache.spark.sql.functions._
df1.join(broadcast(df2))
Broadcast Hash Join
broadcasts DataFrame to all nodes, which makes this join very fast, there is no shuffle
Upvotes: 1