Reputation: 73
Hi All: I am trying to understand how the partitioned tables work. I have a sales table of size 12.9MB. I have a date column that is partitioned by day. My assumption is that when I filter the data table using this date column, the amount of data processed by BigQuery will be optimized. However, it doesn’t seem to work that way, and I would like to understand the reason. In the below query, I am filtering sales.date using a subquery. When I try to execute the query as such, it is processing the entire table of 12.9 MB.
However, if I replace the below subquery with the actual date (the same result that we have from the subquery), then the amount of data processed is 4.9 MB.
The subquery alone processes 630 KB of data. If my understanding is right, shouldn’t the below given query process 4.9 MB + 630 KB = ~ 5.6 MB? But, it still processes 12.9 MB. Can someone explain what’s happening here?
SELECT
sales.*,
FROM `my-project.transaction_data.sales_table` sales
WHERE DATE(sales.date) >= DATE_SUB(DATE((select max(temp.date) FROM ` my-project.transaction_data.sales_table ` temp)), INTERVAL 2 YEAR)
ORDER BY sales.customer, sales.date
Upvotes: 1
Views: 202
Reputation: 172944
Can someone explain what’s happening here?
This is expected behavior
In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
see more at Querying partitioned tables
Possible workaround is to use scripting where you will first calculate the actual date and assign it to valiable and then use it in the query, thus eliminating subquery
Upvotes: 1