Reputation: 11629
I couldn't come up with a good title for this question. Sorry about that.
I have two tables A and B. Both have timestamps and shares a common ID between them. Here are schemas of both tables:
Table A:
========
a_id int,
common_id int,
ts timestamp
...
Table B:
========
b_id int,
common_id int,
ts timestamp,
temperature int
Table A is more like device data whenever it changes its status. Table B is more IoT data which contains a temperature of a device every minute or so.
What I want to do is to create a Table C from these two tables. Table C would be in essence Table A + its temperature in closest time from table B.
How can I do this purely in BigQuery SQL? The temperature info doesn't need to be precise.
Upvotes: 0
Views: 1770
Reputation: 173210
Below option (for BigQuery Standard SQL) assumes that in addition of temperature
from table b you still need all the rest of values from respective row
#standardSQL
SELECT
ARRAY_AGG(
STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature)
ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND))
LIMIT 1
)[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)
I smoke-tested it with below generated dummy data
#standardSQL
WITH `project.dataset.table_a` AS (
SELECT CAST(1000000 * RAND() AS INT64) a_id, common_id, ts
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 45*60 + 27 SECOND)) ts
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
), `project.dataset.table_b` AS (
SELECT CAST(1000000 * RAND() AS INT64) b_id, common_id, ts, CAST(60 + 40 * RAND() AS INT64) temperature
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 1 MINUTE)) ts
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
)
SELECT
ARRAY_AGG(
STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature)
ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND))
LIMIT 1
)[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)
with example of few rows from output:
Row a_id common_id ts b_id b_ts temperature
1 276623 1 2018-01-01 00:00:00 UTC 166995 2018-01-01 00:00:00 UTC 74
2 218354 1 2018-01-01 00:45:27 UTC 464901 2018-01-01 00:45:00 UTC 87
3 265634 1 2018-01-01 01:30:54 UTC 565385 2018-01-01 01:31:00 UTC 87
4 758075 1 2018-01-01 02:16:21 UTC 55894 2018-01-01 02:16:00 UTC 84
5 306355 1 2018-01-01 03:01:48 UTC 844429 2018-01-01 03:02:00 UTC 92
6 348502 1 2018-01-01 03:47:15 UTC 375859 2018-01-01 03:47:00 UTC 90
7 774920 1 2018-01-01 04:32:42 UTC 438164 2018-01-01 04:33:00 UTC 61
Here - I set table_b to have temperature for each minute for 10 devices during the whole day of '2018-01-01' and in table_a I set status changed each 45 min 27 sec for same 10 devices during same day. a_id and b_id - just random numbers between 0 and 999999
Note: ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
clause in JOIN
controls period that you can consider ok to look for closest ts (in case if some IoT entries are absent from table_b
Upvotes: 2
Reputation: 59375
Measuring the closest time by TIMESTAMP_DIFF(a.ts,b.ts, SECOND)
- by its absolute value to get the closest in any direction:
WITH a AS (
SELECT 1 id, TIMESTAMP('2018-01-01 11:01:00') ts
UNION ALL SELECT 1, ('2018-01-02 10:00:00')
UNION ALL SELECT 2, ('2018-01-02 10:00:00')
)
, b AS (
SELECT 1 id, TIMESTAMP('2018-01-01 12:01:00') ts, 43 temp
UNION ALL SELECT 1, TIMESTAMP('2018-01-01 12:06:00'), 47
)
SELECT *,
(SELECT temp
FROM b
WHERE a.id=b.id
ORDER BY ABS(TIMESTAMP_DIFF(a.ts,b.ts, SECOND))
LIMIT 1) temp
FROM a
Upvotes: 1