Nikesh
Nikesh

Reputation: 49

Spark SQL view and partition column usage

I have a Databricks table (parquet not delta) "TableA" with a partition column "dldate", and it has ~3000 columns. When I issue select * from TableA where dldate='2022-01-01', the query completes in seconds.

I have a view "view_tableA" which reads from "TableA" and performs some window functions on some of the columns. When I issue select * from view_tableA where dldate='2022-01-01', the query runs forever.

Will the latter query effectively use the partition key of the table? If not, if there is any optimization I can do to make sure partition key is used?

Upvotes: 1

Views: 1843

Answers (1)

Kombajn zbożowy
Kombajn zbożowy

Reputation: 10703

If partitioning of all window functions is aligned with table partitioning, optimizer will be able to push down the predicate to table level and apply partition pruning.

For example:

SELECT *
  FROM (SELECT *, sum(a) over (partition by dldate) FROM TableA)
 WHERE dldate = '2022-01-01';

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Window [dldate#2932, a#2933, sum(a#2933) ...], [dldate#2932]
   +- Sort [dldate#2932 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(dldate#2932, 200), ...
         +- Project [dldate#2932, a#2933]
            +- FileScan parquet tablea PartitionFilters: [isnotnull(dldate#2932), (dldate#2932 = 2022-01-01)]

Compare this with a query containing window function not partitioned by dldate:

SELECT *
  FROM (SELECT *, sum(a) over (partition by a) FROM TableA)
 WHERE dldate = '2022-01-01';

AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(dldate#2968) AND (dldate#2968 = 2022-01-01))  << !!!
   +- Window [dldate#2968, a#2969, sum(a#2969) ...], [a#2969]
      +- Sort [a#2969 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(a#2969, 200), ...
            +- Project [dldate#2968, a#2969]
               +- FileScan parquet tablea PartitionFilters: []  << !!!

Upvotes: 1

Related Questions