MRT
MRT

Reputation: 833

BigQuery: flatten multiple repeated columns of different length

I have followed the answers given to this question: BigQuery: flatten two repeated columns but it doesn't quite work although it's the closest to what I'm looking for.


I have data being sent from Google Analytics to Google BigQuery from an app. I have 10 repeated columns:

Whenever there is a new event, there will be an:

for each row. These repeated properties can have different lengths for each event and there is a correspondence on the index of the event.

Below is a snapshot of just the first two repeated columns event_params and user_properties along with what I would want to generate with these two columns and the others if needed:

Image of BigQuery default table

Here we see that event_params has length 7 and user_properties has length 4. When I run the following code:

-- standardSQL
SELECT
    event_name, event_params,
    user_properties[OFFSET(off)] AS user_properties
FROM
    `yepic-2021.analytics_264796885.events_intraday_*`,
    UNNEST(event_params) AS event_params WITH OFFSET off
ORDER BY
    event_timestamp DESC
LIMIT 50

but this results in the error:

Array index 4 is out of bounds (overflow)

This makes sense because they are not the same length. So my thought was that if anyone knows how to add null to all the other columns until their length is equal to which ever column is has the longest length then this would produce the fully flattened output that I want.

This is an example of not what I want where there's an explosion of duplicates by flattening on the already flattened table:

-- standardSQL
SELECT
    event_name, event_params, user_properties
FROM
    `yepic-2021.analytics_264796885.events_intraday_*`,
    UNNEST(event_params) AS event_params,
    UNNEST(user_properties) AS user_properties
ORDER BY
    event_timestamp DESC
LIMIT 50

Results:

Image of flatten flatten table


If anyone could help with this approach or knows BigQuery better than myself and a simple method to flatten the data from GA then I would really appreciate your help.


TEMP EDITS:

Here is the code that I have tried in BigQuery:

-- standardSQL
WITH data1 AS (
    SELECT GENERATE_UUID() AS row_id, event_params, user_properties
    FROM `yepic-2021.analytics_264796885.events_intraday_*`
),
data2 AS (
    SELECT *, GENERATE_ARRAY(1, GREATEST(ARRAY_LENGTH(event_params), ARRAY_LENGTH(user_properties))) ordinals
    FROM data1
)
SELECT row_id, event_params[SAFE_ORDINAL(o)] event_params, user_properties[SAFE_ORDINAL(o)] user_properties
FROM data2, UNNEST(ordinals) o

Results:

BigQuery Answer Results

Upvotes: 0

Views: 821

Answers (1)

Michael Entin
Michael Entin

Reputation: 7744

Your first approach almost works, except two issues:

  • you need to use SAFE_OFFSET to avoid errors
  • it only works if event_params has more elements than user_properties, otherwise you miss some user properties.

Let's fix it. To allow for either column to be bigger, we'll take largest of the array length, and use it for array index generation. Then we use SAFE_ORDINAL (SAFE_OFFSET works too):

-- sample input data, id and two two repeated fields x and y.
with data as (
    select 1 id, [1,2,3] x, ['a', 'b'] y 
    union all
    select 2 id, [4,5] x, ['c', 'd', 'e', 'f'] y
), 
-- let's add the ordinals arrays, taking length of longer array.
data2 as (
    select *, 
      generate_array(1, greatest(array_length(x), array_length(y))) ordinals
    from data
)
select id, x[safe_ordinal(o)] x, y[safe_ordinal(o)] y
from data2, unnest(ordinals) o

The result

id  x       y
-------------
1   1       a
1   2       b
1   3       null
2   4       c
2   5       d
2   null    e
2   null    f

Update

This works if the repeated fields are simple. If the repeated field is a RECORD, you get a NULL RECORD. You might want to have a RECORD with NULL leaf values instead, for this use an expression like

coalesce(x[safe_ordinal(o)], struct<a int64, b int64>(null, null)) x

Or to really fully flatten the output table, and only get leaf fields, without any traces of RECORD field, just extract the leaf fields as another step:

-- sample input data, id and two two repeated fields x and y.
with data as (
    select 1 id, [struct<a int64, b int64>(1, 2), (3, 4), (5, 6)] x, ['a', 'b'] y 
    union all
    select 2 id, [struct<a int64, b int64>(7, 8), (9, 10)] x, ['c', 'd', 'e', 'f'] y
), 
data2 as (
  select id, x[safe_ordinal(o)] x, y[safe_ordinal(o)] y
  from data, unnest( generate_array(1, greatest(array_length(x), array_length(y)))) o
)
select id, x.a a, x.b b, y 
from data2

Last note

So we have a solution, but I have to warn that it is really rare that kind of flattening makes sense. In your case, there is no connection between i-th item in user_properties and i-th item in event_params. Here the first user_properties is paired to the first event_params, but it is equally related to any of them. They are just two independent lists, and flattening them this way is pretty arbitrary.

Upvotes: 1

Related Questions