Reputation: 11
I have 2 partitioned tables:
Table 1:
|user_id|request_id|
Table 2:
|ip|user_id|request_id|
I want to get for all ips from partition_table2: - users count(from partition_table1) - users requests(from partition_table1) - user requests(from partition_table2) for users(from partition_table1)
Info: Ip is related to request_id from Table 1, because one user_id can have more than one ip.
Issue: When I filter by _PARTITIONTIME in the main query it doesn't propagate to query from WITH when I do LEFT JOIN, but filter by _PARTITIONTIME is propagated when I do INNER JOIN.
Partition pruning doesn't seem to work: https://cloud.google.com/bigquery/docs/querying-partitioned-tables for LEFT JOIN
My Query:
WITH
users_info AS (
SELECT
t2.ip,
t1.user_id,
COUNT(DISTINCT t1.request_id) AS user_requests,
t1._PARTITIONTIME AS date
FROM partitioned_table1 t1
INNER JOIN partition_table2 t2
ON t1.request_id = t2.request_id
AND t1._PARTITIONTIME = t2._PARTITIONTIME
GROUP BY t2.ip, t1.user_id, t1._PARTITIONTIME
)
SELECT
t2.ip,
COUNT(DISTINCT m.user_id) AS users,
COUNT(DISTINCT t2.request_id) AS t2_users_requests,
SUM(m.user_requests) AS t1_users_requests
FROM partition_table2 t2
LEFT JOIN/INNER JOIN users_info m
ON t2.ip=m.ip
AND t2.user_id=m.user_id
AND m.date = t2._PARTITIONTIME
WHERE DATE(t2._PARTITIONTIME) = "2019-05-20"
GROUP BY t2.ip
If I do INNER JOIN this query processes ~4 GB, but with LEFT JOIN it processes ~3 TB
I did something wrong or is this behaviour expected?
I need this query to create a VIEW. Condition(DATE(t2._PARTITIONTIME) = "2019-05-20") from the above query I'll use to filter the VIEW when I'll query it.
Upvotes: 1
Views: 1851
Reputation: 33705
The columns from the right side of a LEFT OUTER JOIN can potentially be NULL, so yes, BigQuery actually needs to execute the join to figure out the results rather than filtering partitions in advance. If you don't want this behavior, use a subquery where you filter on _PARTITIONTIME
prior to the join.
Upvotes: 1