Reputation: 651
The following is the table I am having,
City date count
Seattle 2016-07-14 10
Seattle 2016-07-15 20
Seattle 2016-07-16 30
Seattle 2016-07-18 40
Seattle 2016-07-19 50
Seattle 2016-07-20 60
Seattle 2016-07-25 70
Seattle 2016-07-26 80
Bellevue 2016-07-21 90
Bellevue 2016-07-22 100
Bellevue 2016-07-23 110
Bellevue 2016-07-25 120
Bellevue 2016-07-26 130
Bellevue 2016-07-27 140
Bellevue 2016-08-10 150
Bellevue 2016-08-11 160
Bellevue 2016-08-12 170
I want to summarize this table into date intervals where every row will contain each interval of date. Whenever there is a break in the days, I want to create another row. My sample output should be as follows,
City min_date max_date sum_count
Seattle 2016-07-14 2016-07-16 60
Seattle 2016-07-18 2016-07-20 150
Seattle 2016-07-25 2016-07-26 150
Bellevue 2016-07-21 2016-07-23 300
Bellevue 2016-07-25 2016-07-27 390
Bellevue 2016-08-10 2016-08-12 480
Here if we can see, whenever there is a break in the dates, a new entry is created and the count is summed across. I want to create a entry whenever there is a break in the date.
I tried,
select city, min(date), max(date) , sum(count) from table
group by city
but that gives only two rows here.
Can anybody help me in doing this in Hive?
Upvotes: 1
Views: 408
Reputation: 1269853
This is a "gaps-and-islands" problem. The difference of row number from the date works:
select city, min(date), max(date), sum(count)
from (select t.*,
row_number() over (partition by city order by date) as seqnum
from t
) t
group by city, date_sub(date, seqnum);
Upvotes: 1