nicholas
nicholas

Reputation: 47

Sum column values over a window based on a week range (impala)

Given a table as follows :

client_id   date            connections
---------------------------------------
121438297   2018-01-03      0
121438297   2018-01-08      1
121438297   2018-01-10      3
121438297   2018-01-12      1
121438297   2018-01-19      7
363863811   2018-01-18      0
363863811   2018-01-30      5
363863811   2018-02-01      4
363863811   2018-02-10      0

I am looking for an efficient way to sum the number of connections that occur within 6 days following the current row (the current row being included in the sum), partitioned by client_id, which would result in :

client_id   date            connections     connections_within_6_days
---------------------------------------------------------------------
121438297   2018-01-03      0               1        
121438297   2018-01-08      1               5     
121438297   2018-01-10      3               4     
121438297   2018-01-12      1               1                       
121438297   2018-01-19      7               7
363863811   2018-01-18      0               0
363863811   2018-01-30      5               9
363863811   2018-02-01      4               4
363863811   2018-02-10      0               0

Issues :

  1. I do not want to add all missing dates and then perform a sliding window counting the 7 following rows because my table is already extremely large.

  2. I am using Impala and the range between interval '7' days following and current row is not supported.


Edit : I am looking for a generic answer taking into account the fact that I will need to change the window size to larger numbers (30+ days for example)

Upvotes: 1

Views: 562

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269623

This answers the original version of the question.

Impala doesn't fully support range between. Unfortunately, that doesn't leave many options. One is to use lag() with lots of explicit logic:

select t.*,
       ( (case when lag(date, 6) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 6) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 5) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 5) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 4) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 4) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 3) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 3) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 2) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 2) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 1) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 1) over (partition by client_id order by date)
               else 0
          end) +
         connections
        ) as connections_within_6_days         
from t;

Unfortunately, this doesn't generalize very well. If you want a wide range of days, you might want to ask another question.

Upvotes: 1

Related Questions