Reputation: 47
Given a table as follows :
client_id date connections
---------------------------------------
121438297 2018-01-03 0
121438297 2018-01-08 1
121438297 2018-01-10 3
121438297 2018-01-12 1
121438297 2018-01-19 7
363863811 2018-01-18 0
363863811 2018-01-30 5
363863811 2018-02-01 4
363863811 2018-02-10 0
I am looking for an efficient way to sum the number of connections that occur within 6 days following the current row (the current row being included in the sum), partitioned by client_id, which would result in :
client_id date connections connections_within_6_days
---------------------------------------------------------------------
121438297 2018-01-03 0 1
121438297 2018-01-08 1 5
121438297 2018-01-10 3 4
121438297 2018-01-12 1 1
121438297 2018-01-19 7 7
363863811 2018-01-18 0 0
363863811 2018-01-30 5 9
363863811 2018-02-01 4 4
363863811 2018-02-10 0 0
Issues :
I do not want to add all missing dates and then perform a sliding window counting the 7 following rows because my table is already extremely large.
I am using Impala and the range between interval '7' days following and current row
is not supported.
Edit : I am looking for a generic answer taking into account the fact that I will need to change the window size to larger numbers (30+ days for example)
Upvotes: 1
Views: 562
Reputation: 1269623
This answers the original version of the question.
Impala doesn't fully support range between
. Unfortunately, that doesn't leave many options. One is to use lag()
with lots of explicit logic:
select t.*,
( (case when lag(date, 6) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 6) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 5) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 5) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 4) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 4) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 3) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 3) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 2) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 2) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 1) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 1) over (partition by client_id order by date)
else 0
end) +
connections
) as connections_within_6_days
from t;
Unfortunately, this doesn't generalize very well. If you want a wide range of days, you might want to ask another question.
Upvotes: 1