Deviling Master
Deviling Master

Reputation: 3113

BigQuery: user metadata described by start date and end date - Create permutations across multiple tables

I have some BigQuery tables that contains some metadata related to an user, and the time interval on when that value is valid.

For example:

  1. this table traces how the user color changes in time:

enter image description here

  1. this table traces how the user fruits changes in time:

enter image description here

First, note that the time interval between the 2 tables are not equals, can partially overlap, overlap entirely or not overlap entirely, based on the nature of the single metadata

The goal here is to merge the metadata columns of the 2 table, in order to create a single table that ricreates the same structure of date start-end, while considering all the values and the date of each change.

So far I did this:


input_1 AS (
    SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color
    UNION ALL
    SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color
    UNION ALL
    SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-2" AS user, "red" AS color
),

input_2 AS (
    SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit
    UNION ALL
    SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit
    UNION ALL
    SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-11') AS date_end, "user-2" AS user, "banana" AS fruit
),

-------------------

input_1_day_by_day AS (
  SELECT day, input_1.* EXCEPT(date_start, date_end)
  FROM input_1  
    CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
),

input_2_day_by_day AS (
  SELECT day, input_2.* EXCEPT(date_start, date_end)
  FROM input_2
    CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
)

--------------------

SELECT
  -- User
  COALESCE(i1.user,i2.user) AS user,

  -- Date interval
  MIN(COALESCE(i1.day, i2.day)) AS date_start,
  MAX(COALESCE(i1.day, i2.day)) AS date_end,
  
  -- Data
  i1.color,
  i2.fruit,

FROM input_1_day_by_day AS i1
  FULL JOIN input_2_day_by_day AS i2 ON i1.user = i2.user AND i1.day = i2.day
  
GROUP BY 1,4,5

ORDER BY 1,2

Basically:

The result is this one:

enter image description here

Now, while the solution seems to be working, it's useful while the tables are samples like these one, one column for each table, 2 tables.

My actual scenario is made by many more tables and each table have multiple columns related. Let'say for example 10 tables and about 20 columns for each table.

My question is:

in order to create the MIN/MAX date intervals, I'm now grouping for each metadata field I have

GROUP BY 1,4,5

If I had the number of tables/columns I mentioned, the resulting grouping would be like

GROUP BY 1,4,5,6,7,8,9,......,30,31,32,......40,41,42,...

Is there a more intelligent way to achieve this kind of result w/o grouping each and single column?

Like some sort of notation GROUP BY ALL EXCEPT 2,3 or a different kind of grouping which can help on this scenario?

Here you can have another kind of input, with some additional columns

input_1 AS (
    SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color, "new-york" as city, "america" as country
    UNION ALL
    SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color, "paris" as city, "france" as country
),

input_2 AS (
    SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit, "dog" as animal, "daisy" AS flower, "iron" AS metal
    UNION ALL
    SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-09') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "steel" as metal
    UNION ALL
    SELECT DATE('2021-05-10') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "iron" as metal
),

Upvotes: 1

Views: 66

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

Is there a more intelligent way to achieve this kind of result w/o grouping each and single column?

Consider below generic solution

select user, 
  min(day) date_start, 
  max(day) date_end, 
  any_value(t).* except(user, day, grp)
from (
  select * except(flag),
    countif(ifnull(flag, true)) over(partition by user order by day) grp
  from (
    select * except(mask1, mask2),
      lag(ifnull(mask1, 'null') || ifnull(mask2, 'null')) over(partition by user order by day) != ifnull(mask1, 'null') || ifnull(mask2, 'null') as flag
    from (
      select user, day, t.* except(user, date_start, date_end),
        to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask1
      from input_1 t, 
      unnest(generate_date_array(date_start, date_end)) day
    ) 
    full outer join (
      select user, day, t.* except(user, date_start, date_end),
        to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask2
      from input_2 t, 
      unnest(generate_date_array(date_start, date_end)) day
    ) 
    using(user, day)
  )
) t
group by t.user, t.grp
# order by t.user, date_start            

if applied to sample data in your question - output is

enter image description here

Upvotes: 1

Related Questions