BROADCASTJOIN hint is not working in PySpark SQL

Question

I am trying to provide broadcast hint to table which is smaller in size, but physical plan is still showing me SortMergeJoin.

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Output :

Note :

Size of tables are in KBs (test data)
Joining column 'serial_id' is not partitioned column
Using glue catalog as metastore (AWS)
Spark Version - Spark 2.4.4
I have tried BROADCASTJOIN and MAPJOIN hint as well
When I am trying to use created_date [partitioned column] instead of serial_id as my joining condition, it is showing me BroadCast Join -

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.created_date = c.created_date').explain()

Output -

Why spark behavior is strange with AWS Glue Catalog as my metastore?

notNull · Accepted Answer

In BROADCAST hint we need to pass the alias name of the table (as you have alias kept in your sql statement).

Try with /*+ BROADCAST(c) */* instead of /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ *

spark.sql('select /*+ BROADCAST(c) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

BROADCASTJOIN hint is not working in PySpark SQL

Answers (1)

Related Questions