Reputation: 15
All, I have a requirement to get Min and Max date in Hive for set of Records. My data would look something like this.
Key start_date end_date
---- ---------- ---------
111 01-01-2017 06-30-2017
111 07-01-2017 07-31-2017
111 09-01-2017 09-30-2017
111 10-01-2017 10-20-2017
111 11-01-2017 11-30-2017
Output what i am expecting is
key start_date end_date
--- --------- --------
111 01-01-2017 07-31-2017
111 09-01-2017 10-20-2017
111 11-01-2017 11-30-2017
Basically, whenever there is a break in date range, i need to write that as a new record.
I tried to get the date difference between the end and start dates and if it is greater than 1, i am tagging that as a new record.But, i am not sure how to split the records after that. Any help or guidance would be highly appreciated.
Expected Result
key start_date end_date
111 01-01-2017 07-31-2017
111 09-01-2017 10-20-2017
111 11-01-2017 11-30-2017
Upvotes: 1
Views: 569
Reputation: 1269583
This is a type of gaps-and-islands problem. You need to determine where "islands" exist. Assuming that the time periods are adjacent (as in your example), you can use join
or exists
:
select key, min(start_date), max(end_date)
from (select t.*,
sum(case when tprev.key is null then 1 else 0 end) over (partition by t.key order by t.start_date) as grp
from t left join
t tprev
on tprev.key = t.key and
tprev.end_date = date_add(t.start_date, -1)
) t
group by key, grp;
This logic can be adjusted to handle overlaps, but that does not seem to be needed for your data.
Upvotes: 1