Sum column values over a window based on variable date range (impala)

Question

Given a table as follows :

client_id   date            connections
---------------------------------------
121438297   2018-01-03      0
121438297   2018-01-08      1
121438297   2018-01-10      3
121438297   2018-01-12      1
121438297   2018-01-19      7
363863811   2018-01-18      0
363863811   2018-01-30      5
363863811   2018-02-01      4
363863811   2018-02-10      0

I am looking for an efficient way to sum the number of connections that occur within x number of days following the current row (the current row being included in the sum), partitioned by client_id.

If x=6 then the output table would result in :

client_id   date            connections     connections_within_6_days
---------------------------------------------------------------------
121438297   2018-01-03      0               1        
121438297   2018-01-08      1               5     
121438297   2018-01-10      3               4     
121438297   2018-01-12      1               1                       
121438297   2018-01-19      7               7
363863811   2018-01-18      0               0
363863811   2018-01-30      5               9
363863811   2018-02-01      4               4
363863811   2018-02-10      0               0

Concerns :

I do not want to add all missing dates and then perform a sliding window counting the x following rows because my table is already extremely large.
I am using Impala and the range between interval 'x' days following and current row is not supported.

Gordon Linoff · Accepted Answer

The generic solution is a bit troublesome for multiple periods, but you can use multiple CTEs to support that. The idea is to "unpivot" the counts based on when they go in and out and then use a cumulative sum.

So:

with conn as (
      select client_id, date, connections
      from t
      union all
      select client_id, date + interval 7 day, -connections
      from t
     ),
     conn1 as (
      select client_id, date,
             sum(sum(connections)) over (partition by client_id order by date) as connections_within_6_days
      from t
      group by client_id, date
     )
select t.*, conn1. connections_within_6_days
from t join
     conn1
     on conn1.client_id = t.client_id and
        conn1.date = t.date;

Sum column values over a window based on variable date range (impala)

Answers (1)

Related Questions