Reputation: 894
I've seen solutions for problems similar to mine, but none quite work for me. Also I'm confident that there should be a way to make it work.
Given a table with
ID | Date | target |
---|---|---|
1 | 2020-01-01 | 1 |
1 | 2020-01-02 | 1 |
1 | 2020-01-03 | 0 |
1 | 2020-01-04 | 1 |
1 | 2020-01-04 | 0 |
1 | 2020-06-01 | 1 |
1 | 2020-06-02 | 1 |
1 | 2020-06-03 | 0 |
1 | 2020-06-04 | 1 |
1 | 2020-06-04 | 0 |
2 | 2020-01-01 | 1 |
ID is BIGINT, target is Int and Date is DATE
I want to compute, for each ID/Date, the sum and the number of rows for the same ID in the 3 months and 12 months before the Date (inclusive). Example of output:
ID | Date | Sum_3 | Count_3 | Sum_12 | Count_12 |
---|---|---|---|---|---|
1 | 2020-01-01 | 1 | 1 | 1 | 1 |
1 | 2020-01-02 | 2 | 2 | 2 | 2 |
1 | 2020-01-03 | 2 | 3 | 2 | 3 |
1 | 2020-01-04 | 3 | 5 | 3 | 5 |
1 | 2020-06-01 | 1 | 1 | 4 | 6 |
1 | 2020-06-02 | 2 | 2 | 5 | 7 |
1 | 2020-06-03 | 2 | 3 | 6 | 8 |
1 | 2020-06-04 | 3 | 5 | 7 | 10 |
2 | 2020-01-01 | 1 | 1 | 1 | 1 |
How can I get this time of results in HIVE? I'm not sure if I should use analytical functions (and how), group by, etc...?
Upvotes: 0
Views: 566
Reputation: 222462
If you can live with an approximation of months as a number of days, then you can use window functions in Hive:
select id, date,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
You can aggregate in the same query:
select id, date,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
group by id, date, unix_timestamp(date)
Upvotes: 2
Reputation: 42352
If you can do an estimation of interval (1 month = 30 days): (an improvement of GMB's answer)
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select ID, Date,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) sum_3,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) count_3,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) sum_12,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) count_12
from t
Or if you want exact intervals, you can do self joins (but expensive):
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select
t_3month.ID,
t_3month.Date,
t_3month.sum_3,
t_3month.count_3,
sum(t3.target) sum_12,
sum(t3.c_target) count_12
from (
select
t1.ID,
t1.Date,
sum(t2.target) sum_3,
sum(t2.c_target) count_3
from t t1
left join t t2
on t2.Date > t1.Date - interval 3 month and
t2.Date <= t1.Date and
t1.ID = t2.ID
group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
t3.Date <= t_3month.Date and
t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;
Upvotes: 1