siva
siva

Reputation: 73

How does partitioning in BigQuery works?

Hi All: I am trying to understand how the partitioned tables work. I have a sales table of size 12.9MB. I have a date column that is partitioned by day. My assumption is that when I filter the data table using this date column, the amount of data processed by BigQuery will be optimized. However, it doesn’t seem to work that way, and I would like to understand the reason. In the below query, I am filtering sales.date using a subquery. When I try to execute the query as such, it is processing the entire table of 12.9 MB.

However, if I replace the below subquery with the actual date (the same result that we have from the subquery), then the amount of data processed is 4.9 MB.

The subquery alone processes 630 KB of data. If my understanding is right, shouldn’t the below given query process 4.9 MB + 630 KB = ~ 5.6 MB? But, it still processes 12.9 MB. Can someone explain what’s happening here?

SELECT 
sales.*,
FROM `my-project.transaction_data.sales_table` sales
WHERE DATE(sales.date) >= DATE_SUB(DATE((select max(temp.date) FROM ` my-project.transaction_data.sales_table ` temp)), INTERVAL 2 YEAR)
ORDER BY sales.customer, sales.date

Upvotes: 1

Views: 202

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Can someone explain what’s happening here?

This is expected behavior

In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans

Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.

see more at Querying partitioned tables

Possible workaround is to use scripting where you will first calculate the actual date and assign it to valiable and then use it in the query, thus eliminating subquery

Upvotes: 1

Related Questions