Pratik Garg
Pratik Garg

Reputation: 847

BROADCASTJOIN hint is not working in PySpark SQL

I am trying to provide broadcast hint to table which is smaller in size, but physical plan is still showing me SortMergeJoin.

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Output : enter image description here

Note :

  1. Size of tables are in KBs (test data)
  2. Joining column 'serial_id' is not partitioned column
  3. Using glue catalog as metastore (AWS)
  4. Spark Version - Spark 2.4.4
  5. I have tried BROADCASTJOIN and MAPJOIN hint as well
  6. When I am trying to use created_date [partitioned column] instead of serial_id as my joining condition, it is showing me BroadCast Join -

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.created_date = c.created_date').explain()

Output - enter image description here

Why spark behavior is strange with AWS Glue Catalog as my metastore?

Upvotes: 1

Views: 2803

Answers (1)

notNull
notNull

Reputation: 31490

In BROADCAST hint we need to pass the alias name of the table (as you have alias kept in your sql statement).

Try with /*+ BROADCAST(c) */* instead of /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ *


spark.sql('select /*+ BROADCAST(c) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Upvotes: 7

Related Questions