BigQuery: how to do semi left join?

Question

I couldn't come up with a good title for this question. Sorry about that.

I have two tables A and B. Both have timestamps and shares a common ID between them. Here are schemas of both tables:

Table A:
========
a_id int,
common_id int,
ts timestamp
...

Table B:
========
b_id int,
common_id int,
ts timestamp,
temperature int

Table A is more like device data whenever it changes its status. Table B is more IoT data which contains a temperature of a device every minute or so.

What I want to do is to create a Table C from these two tables. Table C would be in essence Table A + its temperature in closest time from table B.

How can I do this purely in BigQuery SQL? The temperature info doesn't need to be precise.

Mikhail Berlyant · Accepted Answer

Below option (for BigQuery Standard SQL) assumes that in addition of temperature from table b you still need all the rest of values from respective row

#standardSQL
SELECT 
  ARRAY_AGG(
    STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature) 
    ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND)) 
    LIMIT 1
  )[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a 
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id 
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)

I smoke-tested it with below generated dummy data

#standardSQL
WITH `project.dataset.table_a` AS ( 
  SELECT CAST(1000000 * RAND() AS INT64) a_id, common_id, ts
  FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 45*60 + 27 SECOND)) ts
  CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
), `project.dataset.table_b` AS ( 
  SELECT CAST(1000000 * RAND() AS INT64) b_id, common_id, ts, CAST(60 + 40 * RAND() AS INT64) temperature 
  FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 1 MINUTE)) ts
  CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
) 
SELECT 
  ARRAY_AGG(
    STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature) 
    ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND)) 
    LIMIT 1
  )[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a 
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id 
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)

with example of few rows from output:

Row a_id    common_id ts                        b_id    b_ts                    temperature  
1   276623  1         2018-01-01 00:00:00 UTC   166995  2018-01-01 00:00:00 UTC     74   
2   218354  1         2018-01-01 00:45:27 UTC   464901  2018-01-01 00:45:00 UTC     87   
3   265634  1         2018-01-01 01:30:54 UTC   565385  2018-01-01 01:31:00 UTC     87   
4   758075  1         2018-01-01 02:16:21 UTC   55894   2018-01-01 02:16:00 UTC     84   
5   306355  1         2018-01-01 03:01:48 UTC   844429  2018-01-01 03:02:00 UTC     92   
6   348502  1         2018-01-01 03:47:15 UTC   375859  2018-01-01 03:47:00 UTC     90   
7   774920  1         2018-01-01 04:32:42 UTC   438164  2018-01-01 04:33:00 UTC     61

Here - I set table_b to have temperature for each minute for 10 devices during the whole day of '2018-01-01' and in table_a I set status changed each 45 min 27 sec for same 10 devices during same day. a_id and b_id - just random numbers between 0 and 999999

Note: ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30 clause in JOIN controls period that you can consider ok to look for closest ts (in case if some IoT entries are absent from table_b

BigQuery: how to do semi left join?

Answers (2)

Related Questions