Observer
Observer

Reputation: 651

Summarizing the date, count to minimum and maximum date as per interval - SQL

The following is the table I am having,

City            date    count
Seattle     2016-07-14  10
Seattle     2016-07-15  20
Seattle     2016-07-16  30
Seattle     2016-07-18  40
Seattle     2016-07-19  50
Seattle     2016-07-20  60
Seattle     2016-07-25  70
Seattle     2016-07-26  80
Bellevue    2016-07-21  90
Bellevue    2016-07-22  100
Bellevue    2016-07-23  110
Bellevue    2016-07-25  120
Bellevue    2016-07-26  130
Bellevue    2016-07-27  140
Bellevue    2016-08-10  150
Bellevue    2016-08-11  160
Bellevue    2016-08-12  170

I want to summarize this table into date intervals where every row will contain each interval of date. Whenever there is a break in the days, I want to create another row. My sample output should be as follows,

City        min_date    max_date    sum_count
Seattle     2016-07-14  2016-07-16  60
Seattle     2016-07-18  2016-07-20  150
Seattle     2016-07-25  2016-07-26  150
Bellevue    2016-07-21  2016-07-23  300
Bellevue    2016-07-25  2016-07-27  390
Bellevue    2016-08-10  2016-08-12  480

Here if we can see, whenever there is a break in the dates, a new entry is created and the count is summed across. I want to create a entry whenever there is a break in the date.

I tried,

select city, min(date), max(date) , sum(count) from table 
group by city

but that gives only two rows here.

Can anybody help me in doing this in Hive?

Upvotes: 1

Views: 408

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269853

This is a "gaps-and-islands" problem. The difference of row number from the date works:

select city, min(date), max(date), sum(count)
from (select t.*,
             row_number() over (partition by city order by date) as seqnum
      from t
     ) t
group by city, date_sub(date, seqnum);

Upvotes: 1

Related Questions