How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.

I have a table like this:

| Row |   date   |user id | score |
-----------------------------------
|  1  | 20201120 |  1     |   26  |
-----------------------------------
|  2  | 20201121 |  1     |   14  |
-----------------------------------
|  3  | 20201125 |  1     |   0   |
-----------------------------------
|  4  | 20201114 |  2     |   32  |
-----------------------------------
|  5  | 20201116 |  2     |   0   |
-----------------------------------
|  6  | 20201120 |  2     |   23  |
-----------------------------------

However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:

| Row |   date   |user id | score |
-----------------------------------
|  1  | 20201120 |  1     |   26  |
-----------------------------------
|  2  | 20201121 |  1     |   14  |
-----------------------------------
|  3  | 20201122 |  1     |   14  |
-----------------------------------
|  4  | 20201123 |  1     |   14  |
-----------------------------------
|  5  | 20201124 |  1     |   14  |
-----------------------------------
|  6  | 20201125 |  1     |   0   |
-----------------------------------
|  7  | 20201114 |  2     |   32  |
-----------------------------------
|  8  | 20201115 |  2     |   32  |
-----------------------------------
|  9  | 20201116 |  2     |   0   |
-----------------------------------
|  10 | 20201117 |  2     |   0   |
-----------------------------------
|  11 | 20201118 |  2     |   0   |
-----------------------------------
|  12 | 20201119 |  2     |   0   |
-----------------------------------
|  13 | 20201120 |  2     |   23  |
-----------------------------------

I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.

My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.

I'd really appreciate any help I could get! I've been trying different options without success

Upvotes: 0

Views: 114

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1271003

I think the most efficient method is to use generate_date_array() but in a very particular way:

with t as (
      select t.*,
             date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
      from t
     )
select row_number() over (order by t.user_id, dte) as id,
       t.user_id, dte, t.score
from t cross join join
     unnest(generate_date_array(date,
                                coalesce(next_date, date)
                                interval 1 day
                               )
           ) dte;

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

Below is for BigQuery Standard SQL

#standardSQL
select date, user_id, 
  last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
  select user_id, format_date('%Y%m%d', day) date,  
  from (
    select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
    from `project.dataset.table` 
    group by user_id
  ) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date     

if applied to sample data from your question - output is

enter image description here

Upvotes: 1

GMB
GMB

Reputation: 222672

One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.

select d.date, d.user_id, 
    last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
    select t.user_id, d.date
    from mytable t
    cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
    group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date

Upvotes: 0

Related Questions