Gross
Gross

Reputation: 165

Google BigQuery SQL: How to fill in gaps in a table with dates?

Data Table:

Dates Customer Subscription
20/02/2020 A free
21/02/2020 A free
22/02/2020 A free
23/02/2020 B free
23/03/2020 A full
01/03/2020 B full
01/03/2020 A full
02/03/2020 A full

Need to fill gaps in dates by the value in the previous date

Output:

Dates Customer Last Subscription
20/02/2020 A free
21/02/2020 A free
22/02/2020 A free
23/03/2020 A full
23/03/2020 B free
24/02/2020 A full
24/02/2020 B free
25/02/2020 A full
25/02/2020 B free
26/02/2020 A full
26/02/2020 B free
27/02/2020 A full
27/02/2020 B free
28/02/2020 A full
28/02/2020 B free
01/03/2020 A full
01/03/2020 B full
02/03/2020 A full
02/03/2020 B full

I found a similar solution Duplicate groups of records to fill multiple date gaps in Google BigQuery, but it is not suitable because in my example each Customer has a different start date.

Upvotes: 3

Views: 2209

Answers (3)

Andrey Vityuk
Andrey Vityuk

Reputation: 1019

BigQuery recently introduced GAP_FILL function that can be used to do this (see GAP_FILL BigQuery documentation).

The query will look like this:

SELECT * FROM GAP_FILL(
  TABLE data,  -- input table to gap fill
  'dates',  -- time column name
  INTERVAL 1 DAY,
  partitioning_columns => ['customer'],
  value_columns => [
    ('subscription', 'locf')  -- fill with the last observation
  ]
);

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269443

I strongly recommend approach this using lead() like this:

with data as (
      select cast('2020-02-20' as date) as dates, 'A' as customer, 'free' as subscription union all
      select cast('2020-02-21' as date) as dates, 'A' as customer, 'free' as subscription union all
      select cast('2020-02-22' as date) as dates, 'A' as customer, 'free' as subscription union all
      select cast('2020-02-23' as date) as dates, 'B' as customer, 'free' as subscription union all
      select cast('2020-03-23' as date) as dates, 'A' as customer, 'full' as subscription union all
      select cast('2020-03-01' as date) as dates, 'B' as customer, 'full' as subscription union all
      select cast('2020-03-01' as date) as dates, 'A' as customer, 'full' as subscription union all
      select cast('2020-03-02' as date) as dates, 'A' as customer, 'full' as subscription 
)
select d.customer, dy, d.subscription
from (select d.*,
             lead(dates) over (partition by customer order by dates) as next_date
      from data d
     ) d cross join
     unnest(generate_date_array(d.dates, coalesce(date_add(d.next_date, interval -1 day), d.dates), interval 1 day)) dy;

The reason that I recommend this is because the unnest() occurs within a single row, so the is no data movement for matching dates. In addition no window function is needed to fill in subscription.

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Consider below

with temp as (
  select customer, dates from (    
    select customer, min(dates) min_date, max(dates) max_date
    from `project.dataset.table`
    group by customer
  ), unnest(generate_date_array(min_date, max_date)) dates
)
select customer, dates, 
  first_value(subscription ignore nulls) over win as subscription
from temp a
left join `project.dataset.table` b
using(customer, dates)
window win as (partition by customer order by dates desc rows between current row and unbounded following)
# order by dates, customer          

If to apply to sample data in y our question - output is

enter image description here

Upvotes: 3

Related Questions