Reputation: 3113
I have some BigQuery tables that contains some metadata related to an user, and the time interval on when that value is valid.
For example:
First, note that the time interval between the 2 tables are not equals, can partially overlap, overlap entirely or not overlap entirely, based on the nature of the single metadata
The goal here is to merge the metadata columns of the 2 table, in order to create a single table that ricreates the same structure of date start-end, while considering all the values and the date of each change.
So far I did this:
input_1 AS (
SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color
UNION ALL
SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color
UNION ALL
SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-2" AS user, "red" AS color
),
input_2 AS (
SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit
UNION ALL
SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit
UNION ALL
SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-11') AS date_end, "user-2" AS user, "banana" AS fruit
),
-------------------
input_1_day_by_day AS (
SELECT day, input_1.* EXCEPT(date_start, date_end)
FROM input_1
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
),
input_2_day_by_day AS (
SELECT day, input_2.* EXCEPT(date_start, date_end)
FROM input_2
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
)
--------------------
SELECT
-- User
COALESCE(i1.user,i2.user) AS user,
-- Date interval
MIN(COALESCE(i1.day, i2.day)) AS date_start,
MAX(COALESCE(i1.day, i2.day)) AS date_end,
-- Data
i1.color,
i2.fruit,
FROM input_1_day_by_day AS i1
FULL JOIN input_2_day_by_day AS i2 ON i1.user = i2.user AND i1.day = i2.day
GROUP BY 1,4,5
ORDER BY 1,2
Basically:
The result is this one:
Now, while the solution seems to be working, it's useful while the tables are samples like these one, one column for each table, 2 tables.
My actual scenario is made by many more tables and each table have multiple columns related. Let'say for example 10 tables and about 20 columns for each table.
My question is:
in order to create the MIN/MAX date intervals, I'm now grouping for each metadata field I have
GROUP BY 1,4,5
If I had the number of tables/columns I mentioned, the resulting grouping would be like
GROUP BY 1,4,5,6,7,8,9,......,30,31,32,......40,41,42,...
Is there a more intelligent way to achieve this kind of result w/o grouping each and single column?
Like some sort of notation GROUP BY ALL EXCEPT 2,3
or a different kind of grouping which can help on this scenario?
Here you can have another kind of input, with some additional columns
input_1 AS (
SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color, "new-york" as city, "america" as country
UNION ALL
SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color, "paris" as city, "france" as country
),
input_2 AS (
SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit, "dog" as animal, "daisy" AS flower, "iron" AS metal
UNION ALL
SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-09') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "steel" as metal
UNION ALL
SELECT DATE('2021-05-10') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "iron" as metal
),
Upvotes: 1
Views: 66
Reputation: 173190
Is there a more intelligent way to achieve this kind of result w/o grouping each and single column?
Consider below generic solution
select user,
min(day) date_start,
max(day) date_end,
any_value(t).* except(user, day, grp)
from (
select * except(flag),
countif(ifnull(flag, true)) over(partition by user order by day) grp
from (
select * except(mask1, mask2),
lag(ifnull(mask1, 'null') || ifnull(mask2, 'null')) over(partition by user order by day) != ifnull(mask1, 'null') || ifnull(mask2, 'null') as flag
from (
select user, day, t.* except(user, date_start, date_end),
to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask1
from input_1 t,
unnest(generate_date_array(date_start, date_end)) day
)
full outer join (
select user, day, t.* except(user, date_start, date_end),
to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask2
from input_2 t,
unnest(generate_date_array(date_start, date_end)) day
)
using(user, day)
)
) t
group by t.user, t.grp
# order by t.user, date_start
if applied to sample data in your question - output is
Upvotes: 1