Reputation: 285
I'm trying to combine the data from two flat, related BigQuery views into single nested table architecture, using standard SQL
I have two similar tables:
Analytics Data: One row for every minute in global timespan
-------------------------------------------------------------------
minute_index | users | users_new | ...
-------------------------------------------------------------------
1312017 | 8 | 3 | ...
1312018 | 9 | 2 | ...
1312019 | 5 | 1 | ...
1312020 | 3 | 0 | ...
1312021 | 5 | 2 | ...
1312023 | 4 | 3 | ...
1312024 | 7 | 4 | ...
1312025 | 6 | 3 | ...
1312026 | 9 | 4 | ...
Event Data: One row for each external event that occurred
----------------------------------------
minute_index | event |
----------------------------------------
1312019 | "TV Spot Broadcast" |
1312023 | "Radio Spot Broadcast" |
1312026 | "Radio Spot Broadcast" |
I'm trying to join them together into one table, where each row in the new table contains a subset of the Analytics table that spans that and some number of the following minutes (let's call it 5):
-----------------------------------------------------------------------------
minute_index | event | window_treated |
-----------------------------------------------------------------------------
1312019 | "TV Spot Broadcast" | minute_index | users | users_new |
|------------------------------------
| 1312019 | 5 | 1 |
| 1312020 | 3 | 0 |
| 1312021 | 5 | 2 |
| 1312023 | 4 | 3 |
| 1312024 | 7 | 4 |
-----------------------------------------------------------------------------
1312023 | "Radio Spot Broadcast" | minute_index | users | users_new |
|------------------------------------
| 1312023 | 4 | 3 |
| 1312020 | 3 | 0 |
| 1312021 | 5 | 2 |
| 1312023 | 4 | 3 |
| 1312024 | 7 | 4 |
I actually have been able to construct nested tables like this, but only by building up and joining convoluted sets of intermediate tables that are clearly way more complicated than they should be, if I can only figure out how to do this sort of thing in a single query.
This is just one example of the various approaches I've tried...
SELECT
ed.timestamp AS timestamp,
ed.minute_index AS minute_index,
(SELECT AS STRUCT
ad.minute_index, ad.users, ad.users_new
FROM `my_project.my_dataset.analytics_data` ad
WHERE (ad.minute_index >= ed.minute_index)
AND (ad.minute_index < (ed.minute_index + 5))
ORDER BY
ed.minute_index) AS units_treated
FROM
`my_project.my_dataset.event_data` ed
but it's also one of several that seemed close, but all lead to the same validator error:
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Upvotes: 0
Views: 287
Reputation: 172994
Below is for BigQuery Standard SQL
#standardSQL
SELECT
ed.minute_index,
event,
ARRAY_AGG(ad) window_treated
FROM `my_project.my_dataset.event_data` ed
JOIN `my_project.my_dataset.analytics_data` ad
ON ad.minute_index BETWEEN ed.minute_index AND ed.minute_index + 5
GROUP BY ed.minute_index, event
If to apply to sample data from your question - result will be as below
As you can see - I literally followed the suggestion in error message Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
and transformed correlated subquery
into JOIN
Upvotes: 1