Spark SQL view and partition column usage

Question

I have a Databricks table (parquet not delta) "TableA" with a partition column "dldate", and it has ~3000 columns. When I issue select * from TableA where dldate='2022-01-01', the query completes in seconds.

I have a view "view_tableA" which reads from "TableA" and performs some window functions on some of the columns. When I issue select * from view_tableA where dldate='2022-01-01', the query runs forever.

Will the latter query effectively use the partition key of the table? If not, if there is any optimization I can do to make sure partition key is used?

Kombajn zbożowy · Accepted Answer

If partitioning of all window functions is aligned with table partitioning, optimizer will be able to push down the predicate to table level and apply partition pruning.

For example:

SELECT *
  FROM (SELECT *, sum(a) over (partition by dldate) FROM TableA)
 WHERE dldate = '2022-01-01';

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Window [dldate#2932, a#2933, sum(a#2933) ...], [dldate#2932]
   +- Sort [dldate#2932 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(dldate#2932, 200), ...
         +- Project [dldate#2932, a#2933]
            +- FileScan parquet tablea PartitionFilters: [isnotnull(dldate#2932), (dldate#2932 = 2022-01-01)]

Compare this with a query containing window function not partitioned by dldate:

SELECT *
  FROM (SELECT *, sum(a) over (partition by a) FROM TableA)
 WHERE dldate = '2022-01-01';

AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(dldate#2968) AND (dldate#2968 = 2022-01-01))  << !!!
   +- Window [dldate#2968, a#2969, sum(a#2969) ...], [a#2969]
      +- Sort [a#2969 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(a#2969, 200), ...
            +- Project [dldate#2968, a#2969]
               +- FileScan parquet tablea PartitionFilters: []  << !!!

Spark SQL view and partition column usage

Answers (1)

Related Questions