Reputation: 1
I set the autoBroadcast 200M ,table a is 20KB , table b is 20KB ,table c is 100G. I found "a left join b on..." is a "broadcast join" and register the result as "TempTable" (TempTable is 30KB),my question is when I do "c left join TempTable on...",I expect that autobroadcast the TempTable to make a broadcast join but it made a sort merge join.I also tried cache the TempTable and broadcast DataFrame of the TempTable,but it doesn't work... How can I broadcast the TempTable to make a broadcast join with sparkSQL? I'm using spark-1.6.1 thanks!
Upvotes: 0
Views: 1302
Reputation: 381
Though it's very difficult to understand without any code snippet what you have tried so far. I am sharing some sample code tried at Spark 2.3 where broadcast function applied on Temp View(as Temp table is deprecated in Spark2). In following scala code, I have forcefully use broadcast hash join.
import org.apache.spark.sql.functions.broadcast
val adf = spark.range(99999999)
val bdf = spark.range(99999999)
adf.createOrReplaceTempView("a")
bdf.createOrReplaceTempView("b")
val bdJoinDF = spark.sql("select /*+ BROADCASTJOIN(b) */a.id, b.id from a join b on a.id = b.id")
val normalJoinDF = spark.sql("select a.id, b.id from a join b on a.id = b.id")
println(normalJoinDF.queryExecution) //== Physical Plan == *(5) SortMergeJoin [id#39512L], [id#39514L], Inner
println(bdJoinDF.queryExecution) //== Physical Plan == *(2) BroadcastHashJoin [id#39611L], [id#39613L], Inner, BuildRight, false
Upvotes: 2