user2830451
user2830451

Reputation: 2316

How does on clause work in join in Google BigQuery

I have two queries running in BQ. I expect them having the same results, but it turns out not. I don't understand why it happens.

Query 1:

select a.* from
(
  select *, _PARTITIONTIME as pt from
  `table_A`
  where 
  p > 5 and q != 0
) as a
left join
(select distinct x, pt from `table_b`) as b
on join_clauses

Query 2:

select a.* from
table_A as a
left join
(select distinct x, pt from table_b) as b
on join_clauses
and a.p > 5 and a.q != 0

If I remove the conditions of a.p > 5 and a.q != 0 from both queries, the results are identical.

As far as I know, the conditions under the on statements is evaluated before the join, which should make the two queries equal. Then I don't know why do I see different results here. Is there any special setup in Google's BQ?

Upvotes: 0

Views: 1556

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173003

In your Query#2 - result will contain ALL rows from table_A being left join to second sub-query. So, for each row in table_a output will have as many respective rows as many matches of on join_clauses and a.p > 5 and a.q != 0 exists in second sub-query - BUT at least one (in case if no matches for that row - this is because of LEFT JOIN used)

In Query #1 - first you filter out from table_a all rows which don't match WHERE clause p > 5 and q != 0 and only then you apply on join_clauses so you start with subset of rows in table_a thus the final output has lower rows than for Query #2

Run these two queries which are simplified version of yours to see this

Analogue of Query #2

#standardSQL
WITH table_a AS (
  SELECT 1 id, 1 p, 0 q UNION ALL
  SELECT 1, 6, 1 UNION ALL
  SELECT 2, 7, 2
), table_b AS (
  SELECT 1 id UNION ALL
  SELECT 2
)
SELECT a.*, b.id idb 
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
AND a.p > 5 AND a.q != 0  

with result

Row id  p   q   idb  
1   1   1   0   null     
2   1   6   1   1    
3   2   7   2   2      

and analogue of Query #1

#standardSQL
WITH table_a AS (
  SELECT 1 id, 1 p, 0 q UNION ALL
  SELECT 1, 6, 1 UNION ALL
  SELECT 2, 7, 2
), table_b AS (
  SELECT 1 id UNION ALL
  SELECT 2
)
SELECT a.*, b.id idb 
FROM (
  SELECT * FROM table_a 
  WHERE p > 5 AND q != 0
) a
LEFT JOIN table_b b
ON a.id = b.id

with result

Row id  p   q   idb  
1   1   6   1   1    
2   2   7   2   2    

comparing those two - you can easily see the difference in result and hopefully see now why :o)

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269953

This has nothing to do with BigQuery, only with your understanding of LEFT JOIN.

A LEFT JOIN keeps all rows in the first table, regardless of whether the ON clause evaluates to true, false, or NULL.

That means that filters on the first table have no impact in the FROM clause. Well, they do have an impact -- otherwise matching rows from the second table will have NULL values. However, hardly anyone notices or desires that effect.

The rules for outer joins are simple:

  • For LEFT JOIN, filters on the first table go in the WHERE clause.
  • For LEFT JOIN, filters on subsequent tables go in the ON clause (otherwise the outer join becomes an inner join).
  • For RIGHT JOIN . . . replace with LEFT JOIN. (This can usually be done and most people find the query to be easier to follow.)

Upvotes: 0

Related Questions