Reputation: 1542
can anyone help in to understand the behaviour of the below query. why there is a broadcast join happening as shown in the physical plan but i am not doing any broadcast join in the query.
query:
SELECT count(*) FROM table WHERE date_id in (select max(date_id) from table)
== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#17L])
+- Exchange SinglePartition
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#20L])
+- *(2) Project
+- *(2) BroadcastHashJoin [date_id#14], [max(date_id)#16], LeftSemi, BuildRight
:- *(2) FileScan parquet table[date_id#14] Batched: true, Format: Parquet, Location: CatalogFileIndex[gs://data/features/smart_subs/pipeline/s..., PartitionCount: 14, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- SortAggregate(key=[], functions=[max(date_id#14)], output=[max(date_id)#16])
+- Exchange SinglePartition
+- SortAggregate(key=[], functions=[partial_max(date_id#14)], output=[max#22])
+- *(1) FileScan parquet table[date_id#14] Batched: true, Format: Parquet, Location: CatalogFileIndex[gs:/data/features/smart_subs/pipeline/s..., PartitionCount: 14, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
Upvotes: 0
Views: 172
Reputation: 94
Here in this screenshot you can see that in databricks they have explained using subqueries will be making a Broadcast Nested Loop join.
For further information, you can read this article: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html
Upvotes: 2