Reputation: 833
I have followed the answers given to this question: BigQuery: flatten two repeated columns but it doesn't quite work although it's the closest to what I'm looking for.
I have data being sent from Google Analytics to Google BigQuery from an app. I have 10 repeated columns:
event_params [RECORD REPEATED]
user_properties [RECORD REPEATED]
user_ltv [RECORD NULLABLE]
device [RECORD NULLABLE]
geo [RECORD NULLABLE]
app_info [RECORD NULLABLE]
traffic_source [RECORD NULLABLE]
event_dimensions [RECORD NULLABLE]
ecommerce [RECORD NULLABLE]
items [RECORD REPEATED]
Whenever there is a new event, there will be an:
event_date
event_timestamp
event_name
for each row. These repeated properties can have different lengths for each event and there is a correspondence on the index of the event.
Below is a snapshot of just the first two repeated columns event_params
and user_properties
along with what I would want to generate with these two columns and the others if needed:
Here we see that event_params
has length 7 and user_properties
has length 4. When I run the following code:
-- standardSQL
SELECT
event_name, event_params,
user_properties[OFFSET(off)] AS user_properties
FROM
`yepic-2021.analytics_264796885.events_intraday_*`,
UNNEST(event_params) AS event_params WITH OFFSET off
ORDER BY
event_timestamp DESC
LIMIT 50
but this results in the error:
Array index 4 is out of bounds (overflow)
This makes sense because they are not the same length. So my thought was that if anyone knows how to add null
to all the other columns until their length is equal to which ever column is has the longest length then this would produce the fully flattened output that I want.
This is an example of not what I want where there's an explosion of duplicates by flattening on the already flattened table:
-- standardSQL
SELECT
event_name, event_params, user_properties
FROM
`yepic-2021.analytics_264796885.events_intraday_*`,
UNNEST(event_params) AS event_params,
UNNEST(user_properties) AS user_properties
ORDER BY
event_timestamp DESC
LIMIT 50
Results:
If anyone could help with this approach or knows BigQuery better than myself and a simple method to flatten the data from GA then I would really appreciate your help.
Here is the code that I have tried in BigQuery:
-- standardSQL
WITH data1 AS (
SELECT GENERATE_UUID() AS row_id, event_params, user_properties
FROM `yepic-2021.analytics_264796885.events_intraday_*`
),
data2 AS (
SELECT *, GENERATE_ARRAY(1, GREATEST(ARRAY_LENGTH(event_params), ARRAY_LENGTH(user_properties))) ordinals
FROM data1
)
SELECT row_id, event_params[SAFE_ORDINAL(o)] event_params, user_properties[SAFE_ORDINAL(o)] user_properties
FROM data2, UNNEST(ordinals) o
Results:
Upvotes: 0
Views: 821
Reputation: 7744
Your first approach almost works, except two issues:
SAFE_OFFSET
to avoid errorsevent_params
has more elements than user_properties
, otherwise you miss some user properties.Let's fix it. To allow for either column to be bigger, we'll take largest of the array length, and use it for array index generation. Then we use SAFE_ORDINAL
(SAFE_OFFSET
works too):
-- sample input data, id and two two repeated fields x and y.
with data as (
select 1 id, [1,2,3] x, ['a', 'b'] y
union all
select 2 id, [4,5] x, ['c', 'd', 'e', 'f'] y
),
-- let's add the ordinals arrays, taking length of longer array.
data2 as (
select *,
generate_array(1, greatest(array_length(x), array_length(y))) ordinals
from data
)
select id, x[safe_ordinal(o)] x, y[safe_ordinal(o)] y
from data2, unnest(ordinals) o
The result
id x y
-------------
1 1 a
1 2 b
1 3 null
2 4 c
2 5 d
2 null e
2 null f
Update
This works if the repeated fields are simple. If the repeated field is a RECORD, you get a NULL RECORD. You might want to have a RECORD with NULL leaf values instead, for this use an expression like
coalesce(x[safe_ordinal(o)], struct<a int64, b int64>(null, null)) x
Or to really fully flatten the output table, and only get leaf fields, without any traces of RECORD field, just extract the leaf fields as another step:
-- sample input data, id and two two repeated fields x and y.
with data as (
select 1 id, [struct<a int64, b int64>(1, 2), (3, 4), (5, 6)] x, ['a', 'b'] y
union all
select 2 id, [struct<a int64, b int64>(7, 8), (9, 10)] x, ['c', 'd', 'e', 'f'] y
),
data2 as (
select id, x[safe_ordinal(o)] x, y[safe_ordinal(o)] y
from data, unnest( generate_array(1, greatest(array_length(x), array_length(y)))) o
)
select id, x.a a, x.b b, y
from data2
Last note
So we have a solution, but I have to warn that it is really rare that kind of flattening makes sense. In your case, there is no connection between i-th item in user_properties
and i-th item in event_params
. Here the first user_properties
is paired to the first event_params
, but it is equally related to any of them. They are just two independent lists, and flattening them this way is pretty arbitrary.
Upvotes: 1