Reputation: 502
I ran into something odd with left outer joins on multiple keys in BigQuery. If one of the keys is null
on right table it doesn't match and therefore it puts null
in all the values for the right table.
If I've got 2 tables
Table1 Table2
------------ ----------------
k1 |k2 |v3 k1 |k2 | v4
------------ ----------------
foo|boo |hey foo|NULL| you
bar|test|yo bar|test| hi
foo|NULL|hey
and I do the following join
SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1
LEFT OUTER JOIN EACH [Table2] t2
ON t1.k1=t2.k1 AND t1.k2=t2.k2
I get a result like this
t1_k1|t1_k2|t1_v3|t2_v4
-----------------------
foo |boo |hey |NULL --No match here so NULL in t2_v4 it's ok
bar |test |hey |hi --It matches here on bar and test
foo |NULL |hey |NULL --It doesn't match on foo NULL.
I would expect that the last row was
foo |NULL |hey |you
Is that the expected behaviour? With this I mean is no matches on NULL
.
Is there any other way of producing the result I want?
Upvotes: 21
Views: 27183
Reputation: 10611
The problem is just with how BigQuery handles null and not a problem with general SQL, so based on the same explanation from Gordon and Geremy this also should work,
SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1
LEFT OUTER JOIN EACH [Table2] t2
ON IFNULL(t1.k1,"null")=IFNULL(t2.k1,"null")
AND IFNULL(t1.k2,"null")=IFNULL(t2.k2, "null")
I had a similar problem and fixed it in the above manner.
Upvotes: 0
Reputation: 1
I had the same problem today on Bigquery (standard SQL) and I solved it using CASE WHEN.
SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1
LEFT OUTER JOIN EACH [Table2] t2
ON t1.k1=t2.k1
AND CASE WHEN(t1.k2 IS NULL THEN '1' ELSE t1.k2 END) = CASE WHEN(t2.k2 IS NULL THEN '1' ELSE t2.k2 END)
Upvotes: 0
Reputation: 7046
In SQL, NULL
is not equal to anything--even itself.
Gordon's answer has some reasonable suggestions, but note that BigQuery only supports join conditions that are conjunctions of equalities, which rules out use of OR
or IS NULL
.
Can you use a sentinel value other than null? For example, if you replace NULL
with the empty string (or the string "null"
, or something else that doesn't occur elsewhere in your data), the join will work as you expect. You can even do this on the fly by using a subquery, at minimal performance cost.
SELECT t1.k1, t1.k2, t1.v3, t2.v4
FROM
(SELECT IFNULL(k1, "null") k1, IFNULL(k2, "null") k2 FROM [Table1]) t1
LEFT OUTER JOIN EACH
(SELECT IFNULL(v3, "null") v3, IFNULL(v4, "null") v4 FROM [Table2]) t2
ON t1.k1 = t2.k1 AND t1.k2 = t2.k2
You can add an outer select to turn the string "null"
back into a real NULL
.
Obviously this only works if the string "null"
doesn't occur elsewhere in your data.
Upvotes: 23
Reputation: 1269883
This is standard behavior. One method to fix it is to make the join
clause more complicated:
SELECT t1.k1, t1.k2, t1.v3, t2.v4
FROM [Table1] t1 LEFT OUTER JOIN EACH
[Table2] t2
ON (t1.k1 = t2.k1 OR (t1.k1 is null and t2.k1 is null)) AND
(t1.k2 = t2.k2 OR (t1.k2 is null and t2.k2 is null))
This could have a bad impact on performance. Some databases have a null-safe equality operation that can be used, but I don't think BigQuery supports that.
So, the best advice is to fix the data so NULL
is not a valid key value. Otherwise, you can use multiple joins. If you only had one join
key, it would look like
SELECT t1.k1, t1.k2, t1.v3, coalesce(t2.v4, t2null.v4)
FROM [Table1] t1 LEFT OUTER JOIN EACH
[Table2] t2
ON t1.k1 = t2.k1 LEFT OUTER JOIN EACH
[Table2] t2null
ON (t1.k1 is null and t2.k1 is null) ;
However, this is harder to put together for composite join
keys.
Upvotes: 3