pablorc
pablorc

Reputation: 502

Left outer join in BigQuery on multiple keys doesn't if one of them is null

I ran into something odd with left outer joins on multiple keys in BigQuery. If one of the keys is null on right table it doesn't match and therefore it puts null in all the values for the right table.

If I've got 2 tables

   Table1              Table2
------------      ----------------
 k1 |k2  |v3      k1 |k2  | v4
------------      ----------------
 foo|boo |hey     foo|NULL| you
 bar|test|yo      bar|test| hi
 foo|NULL|hey

and I do the following join

SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1
LEFT OUTER JOIN EACH [Table2] t2 
ON t1.k1=t2.k1 AND t1.k2=t2.k2

I get a result like this

t1_k1|t1_k2|t1_v3|t2_v4
-----------------------
foo  |boo  |hey  |NULL  --No match here so NULL in t2_v4 it's ok
bar  |test |hey  |hi    --It matches here on bar and test
foo  |NULL |hey  |NULL  --It doesn't match on foo NULL.

I would expect that the last row was

foo  |NULL |hey  |you

Is that the expected behaviour? With this I mean is no matches on NULL.

Is there any other way of producing the result I want?

Upvotes: 21

Views: 27183

Answers (4)

Stranger
Stranger

Reputation: 10611

The problem is just with how BigQuery handles null and not a problem with general SQL, so based on the same explanation from Gordon and Geremy this also should work,

SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1
LEFT OUTER JOIN EACH [Table2] t2 
ON IFNULL(t1.k1,"null")=IFNULL(t2.k1,"null") 
   AND IFNULL(t1.k2,"null")=IFNULL(t2.k2, "null")

I had a similar problem and fixed it in the above manner.

Upvotes: 0

I had the same problem today on Bigquery (standard SQL) and I solved it using CASE WHEN.

SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1
LEFT OUTER JOIN EACH [Table2] t2 
ON t1.k1=t2.k1 
AND CASE WHEN(t1.k2 IS NULL THEN '1' ELSE t1.k2 END) = CASE WHEN(t2.k2 IS NULL THEN '1' ELSE t2.k2 END)

Upvotes: 0

Jeremy Condit
Jeremy Condit

Reputation: 7046

In SQL, NULL is not equal to anything--even itself.

Gordon's answer has some reasonable suggestions, but note that BigQuery only supports join conditions that are conjunctions of equalities, which rules out use of OR or IS NULL.

Can you use a sentinel value other than null? For example, if you replace NULL with the empty string (or the string "null", or something else that doesn't occur elsewhere in your data), the join will work as you expect. You can even do this on the fly by using a subquery, at minimal performance cost.

SELECT t1.k1, t1.k2, t1.v3, t2.v4
FROM
  (SELECT IFNULL(k1, "null") k1, IFNULL(k2, "null") k2 FROM [Table1]) t1
LEFT OUTER JOIN EACH
  (SELECT IFNULL(v3, "null") v3, IFNULL(v4, "null") v4 FROM [Table2]) t2
ON t1.k1 = t2.k1 AND t1.k2 = t2.k2

You can add an outer select to turn the string "null" back into a real NULL.

Obviously this only works if the string "null" doesn't occur elsewhere in your data.

Upvotes: 23

Gordon Linoff
Gordon Linoff

Reputation: 1269883

This is standard behavior. One method to fix it is to make the join clause more complicated:

SELECT t1.k1, t1.k2, t1.v3, t2.v4
FROM [Table1] t1 LEFT OUTER JOIN EACH 
     [Table2] t2 
     ON (t1.k1 = t2.k1 OR (t1.k1 is null and t2.k1 is null)) AND
        (t1.k2 = t2.k2 OR (t1.k2 is null and t2.k2 is null))

This could have a bad impact on performance. Some databases have a null-safe equality operation that can be used, but I don't think BigQuery supports that.

So, the best advice is to fix the data so NULL is not a valid key value. Otherwise, you can use multiple joins. If you only had one join key, it would look like

SELECT t1.k1, t1.k2, t1.v3, coalesce(t2.v4, t2null.v4)
FROM [Table1] t1 LEFT OUTER JOIN EACH 
     [Table2] t2 
     ON t1.k1 = t2.k1  LEFT OUTER JOIN EACH
     [Table2] t2null
     ON (t1.k1 is null and t2.k1 is null) ;

However, this is harder to put together for composite join keys.

Upvotes: 3

Related Questions