MatBailie
MatBailie

Reputation: 86808

Windowed Average, accounting for gaps

I need to calculate an average over the preceding 4 weeks...

SELECT
  *,
  AVG(val) OVER (PARTITION BY some_identifier, day_of_week_column
                     ORDER BY date_column
                  ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                )
                   AS preceding_4_week_average
FROM
  myTable

The data, however, is "sparse"

In this case my window function should look back "4 weeks" rather than "4 rows".
- A missing date is not a 0 it's implicitly a NULL

 thing | date       | dow | val | avg
   1   | 2018-01-01 |  1  |  1  | NULL  <= AVG({})
   1   | 2018-01-08 |  1  |  2  | 1     <= AVG({1})
   1   | 2018-01-15 |  1  |  3  | 1.5   <= AVG({1,2})
   1   | 2018-01-22 |  1  |  4  | 2     <= AVG({1,2,3})
   1   | 2018-01-29 |  1  |  5  | 2.5   <= AVG({1,2,3,4})

   1   | 2018-02-12 |  1  |  7  | 4     <= AVG({3,4,5})
   1   | 2018-02-19 |  1  |  8  | 5.33  <= AVG({4,5,7})
   1   | 2018-02-26 |  1  |  9  | 6.66  <= AVG({5,7,8})
   1   | 2018-03-05 |  1  |  10 | 8     <= AVG({7,8,9})
   1   | 2018-03-12 |  1  |  11 | 11.25 <= AVG({7,8,9,10})
   1   | 2018-03-19 |  1  |  12 | 9.5   <= AVG({8,9,10,11})

Note : There is no value for 2018-02-05

I would normally approach it in one of two ways...


  1. LEFT JOIN on to a template to "force" all the dates to exist, and rely on AVG() effectively "ignoring" NULLs.

This is less than ideal as the number of "things" is huge and constructing this template is expensive.

SELECT
  *,
  AVG(mytable.val) OVER (PARTITION BY things.id, dates.dow
                             ORDER BY dates.date
                          ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                        )
                           AS preceding_4_week_average
FROM
  things
CROSS JOIN
  dates
LEFT JOIN
  myTable
    ON  myTable.date = dates.date
    AND myTable.id   = things.id


  1. Don't use window functions, use a self join instead

This is less than ideal as there are hundreds of columns in myTable and BigQuery isn't performing very well on that.

SELECT
  myTable.*,
  AVG(hist.val)   AS preceding_4_week_average
FROM
  myTable
LEFT JOIN
  myTable   AS hist
    ON  hist.id    = myTable.id
    AND hist.date >= myTable.date - INTERVAL 28 DAYS
    AND hist.date <  myTable.date
GROUP BY
  myTable.column1,
  myTable.column2,
  etc, etc


The actual question

Does anyone else have an alternative, preferably using windowed/analytic functions to "look back 4 weeks" rather than "look back 4 rows"?

Upvotes: 1

Views: 552

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173210

Below is for BigQuery Standard SQL

As you will see - the trick is in using RANGE instead of ROW

#standardSQL
SELECT *,
  AVG(val) OVER(
    PARTITION BY id, dow 
    ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
    RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
  ) AVG
FROM `project.dataset.table`   

You can test, play with above using dummy data from your question as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, DATE '2018-01-01' date, 1 dow, 1 val UNION ALL
  SELECT 1, '2018-01-08', 1, 2  UNION ALL
  SELECT 1, '2018-01-15', 1, 3  UNION ALL
  SELECT 1, '2018-01-22', 1, 4  UNION ALL
  SELECT 1, '2018-01-29', 1, 5  UNION ALL
  SELECT 1, '2018-02-12', 1, 7  UNION ALL
  SELECT 1, '2018-02-19', 1, 8  UNION ALL
  SELECT 1, '2018-02-26', 1, 9  UNION ALL
  SELECT 1, '2018-03-05', 1, 10 UNION ALL
  SELECT 1, '2018-03-12', 1, 11 UNION ALL
  SELECT 1, '2018-03-19', 1, 12  
)
SELECT *,
  AVG(val) OVER(
    PARTITION BY id, dow 
    ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
    RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
  ) avg
FROM `project.dataset.table`
-- ORDER BY date

with result as

Row id      date    dow val avg 
1   1   2018-01-01  1   1   null     
2   1   2018-01-08  1   2   1.0  
3   1   2018-01-15  1   3   1.5  
4   1   2018-01-22  1   4   2.0  
5   1   2018-01-29  1   5   2.5  
6   1   2018-02-12  1   7   4.0  
7   1   2018-02-19  1   8   5.333333333333333    
8   1   2018-02-26  1   9   6.666666666666667    
9   1   2018-03-05  1   10  8.0  
10  1   2018-03-12  1   11  8.5  
11  1   2018-03-19  1   12  9.5  

Upvotes: 3

Gordon Linoff
Gordon Linoff

Reputation: 1271241

This is brute force, but it should be faster:

select t.*,
       ((case when date_1 >= date_add(date, interval -4 week)
              then val_1 else 0
         end) +
        (case when date_2 >= date_add(date, interval -4 week)
              then val_2 else 0
         end) +
        (case when date_3 >= date_add(date, interval -4 week)
              then val_3 else 0
         end) +
        (case when date_4 >= date_add(date, interval -4 week)
              then val_4 else 0
         end)
        ) /
       ((case when date_1 >= date_add(date, interval -4 week)
              then 1 else 0
         end) +
        (case when date_2 >= date_add(date, interval -4 week)
              then 1 else 0
         end) +
        (case when date_3 >= date_add(date, interval -4 week)
              then 1 else 0
         end) +
        (case when date_4 >= date_add(date, interval -4 week)
              then 1 else 0
         end)
        )        
from (select t.*,
             lag(val, 1) over (partition by id, dow order by date) as val_1,
             lag(val, 2) over (partition by id, dow order by date) as val_2,
             lag(val, 3) over (partition by id, dow order by date) as val_3,
             lag(val, 4) over (partition by id, dow order by date) as val_4,
             lag(date, 1) over (partition by id, dow order by date) as date_1,
             lag(date, 2) over (partition by id, dow order by date) as date_2,
             lag(date, 3) over (partition by id, dow order by date) as date_3,
             lag(date, 4) over (partition by id, dow order by date) as date_4
   from mytable t
     ) t;

There is probably a clever way to express this using arrays, but it's a bit earlier where I am.

Upvotes: 0

Related Questions