Laurent Stanevich
Laurent Stanevich

Reputation: 285

BigQuery SQL: Embedding subqueries from View A as nested tables in View B

I'm trying to combine the data from two flat, related BigQuery views into single nested table architecture, using standard SQL

I have two similar tables:

Analytics Data: One row for every minute in global timespan

-------------------------------------------------------------------
minute_index | users | users_new | ...
-------------------------------------------------------------------
     1312017 |     8 |         3 | ...
     1312018 |     9 |         2 | ...
     1312019 |     5 |         1 | ...
     1312020 |     3 |         0 | ...
     1312021 |     5 |         2 | ...
     1312023 |     4 |         3 | ...
     1312024 |     7 |         4 | ...
     1312025 |     6 |         3 | ...
     1312026 |     9 |         4 | ...

Event Data: One row for each external event that occurred

----------------------------------------
minute_index | event                   |
----------------------------------------
     1312019 | "TV Spot Broadcast"     |
     1312023 | "Radio Spot Broadcast"  |
     1312026 | "Radio Spot Broadcast"  |

I'm trying to join them together into one table, where each row in the new table contains a subset of the Analytics table that spans that and some number of the following minutes (let's call it 5):

-----------------------------------------------------------------------------
minute_index | event                    | window_treated                   |
-----------------------------------------------------------------------------
     1312019 | "TV Spot Broadcast"      | minute_index | users | users_new |
                                        |------------------------------------
                                        |      1312019 |     5 |         1 |
                                        |      1312020 |     3 |         0 |
                                        |      1312021 |     5 |         2 |
                                        |      1312023 |     4 |         3 |
                                        |      1312024 |     7 |         4 |
-----------------------------------------------------------------------------
     1312023 | "Radio Spot Broadcast"   | minute_index | users | users_new |
                                        |------------------------------------
                                        |      1312023 |     4 |         3 |
                                        |      1312020 |     3 |         0 |
                                        |      1312021 |     5 |         2 |
                                        |      1312023 |     4 |         3 |
                                        |      1312024 |     7 |         4 |

I actually have been able to construct nested tables like this, but only by building up and joining convoluted sets of intermediate tables that are clearly way more complicated than they should be, if I can only figure out how to do this sort of thing in a single query.

This is just one example of the various approaches I've tried...

SELECT
  ed.timestamp AS timestamp,
  ed.minute_index AS minute_index,
  (SELECT AS STRUCT 
     ad.minute_index, ad.users, ad.users_new
   FROM `my_project.my_dataset.analytics_data` ad 
   WHERE (ad.minute_index >= ed.minute_index) 
      AND (ad.minute_index < (ed.minute_index + 5))
   ORDER BY
      ed.minute_index) AS units_treated
FROM
  `my_project.my_dataset.event_data` ed

but it's also one of several that seemed close, but all lead to the same validator error:

Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

Upvotes: 0

Views: 287

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

Below is for BigQuery Standard SQL

#standardSQL
SELECT 
  ed.minute_index, 
  event, 
  ARRAY_AGG(ad) window_treated
FROM `my_project.my_dataset.event_data` ed
JOIN `my_project.my_dataset.analytics_data` ad
ON ad.minute_index BETWEEN ed.minute_index AND ed.minute_index + 5
GROUP BY ed.minute_index, event   

If to apply to sample data from your question - result will be as below

enter image description here

As you can see - I literally followed the suggestion in error message Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. and transformed correlated subquery into JOIN

Upvotes: 1

Related Questions