Laura Malaiasi
Laura Malaiasi

Reputation: 11

Bigquery: Filter by _PARTITIONTIME doesn't propagate on LEFT JOIN

I have 2 partitioned tables:

Table 1:


|user_id|request_id|


Table 2:


|ip|user_id|request_id|


I want to get for all ips from partition_table2: - users count(from partition_table1) - users requests(from partition_table1) - user requests(from partition_table2) for users(from partition_table1)

Info: Ip is related to request_id from Table 1, because one user_id can have more than one ip.

Issue: When I filter by _PARTITIONTIME in the main query it doesn't propagate to query from WITH when I do LEFT JOIN, but filter by _PARTITIONTIME is propagated when I do INNER JOIN.

Partition pruning doesn't seem to work: https://cloud.google.com/bigquery/docs/querying-partitioned-tables for LEFT JOIN

My Query:

WITH
  users_info AS (
  SELECT
    t2.ip,
    t1.user_id,
    COUNT(DISTINCT t1.request_id) AS user_requests,
    t1._PARTITIONTIME AS date
  FROM partitioned_table1 t1
  INNER JOIN partition_table2 t2
    ON t1.request_id = t2.request_id
    AND t1._PARTITIONTIME = t2._PARTITIONTIME
  GROUP BY t2.ip, t1.user_id, t1._PARTITIONTIME
  )
SELECT
  t2.ip,
  COUNT(DISTINCT m.user_id) AS users,
  COUNT(DISTINCT t2.request_id) AS t2_users_requests,
  SUM(m.user_requests) AS t1_users_requests
FROM partition_table2 t2
LEFT JOIN/INNER JOIN users_info m
  ON t2.ip=m.ip
  AND t2.user_id=m.user_id
  AND m.date = t2._PARTITIONTIME
WHERE DATE(t2._PARTITIONTIME) = "2019-05-20" 
GROUP BY t2.ip

If I do INNER JOIN this query processes ~4 GB, but with LEFT JOIN it processes ~3 TB

I did something wrong or is this behaviour expected?


EDIT

I need this query to create a VIEW. Condition(DATE(t2._PARTITIONTIME) = "2019-05-20") from the above query I'll use to filter the VIEW when I'll query it.

Upvotes: 1

Views: 1851

Answers (1)

Elliott Brossard
Elliott Brossard

Reputation: 33705

The columns from the right side of a LEFT OUTER JOIN can potentially be NULL, so yes, BigQuery actually needs to execute the join to figure out the results rather than filtering partitions in advance. If you don't want this behavior, use a subquery where you filter on _PARTITIONTIME prior to the join.

Upvotes: 1

Related Questions