Ron
Ron

Reputation: 15

Min and Max dates from the given set of records in Hive

All, I have a requirement to get Min and Max date in Hive for set of Records. My data would look something like this.

Key    start_date    end_date
----   ----------    ---------      
111    01-01-2017    06-30-2017
111    07-01-2017    07-31-2017
111    09-01-2017    09-30-2017
111    10-01-2017    10-20-2017
111    11-01-2017    11-30-2017

Output what i am expecting is

key   start_date    end_date
---   ---------     --------
111   01-01-2017    07-31-2017
111   09-01-2017    10-20-2017
111   11-01-2017    11-30-2017

Basically, whenever there is a break in date range, i need to write that as a new record.

I tried to get the date difference between the end and start dates and if it is greater than 1, i am tagging that as a new record.But, i am not sure how to split the records after that. Any help or guidance would be highly appreciated.

Expected Result

key   start_date    end_date
111   01-01-2017    07-31-2017
111   09-01-2017    10-20-2017
111   11-01-2017    11-30-2017

Upvotes: 1

Views: 569

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269583

This is a type of gaps-and-islands problem. You need to determine where "islands" exist. Assuming that the time periods are adjacent (as in your example), you can use join or exists:

select key, min(start_date), max(end_date)
from (select t.*,
             sum(case when tprev.key is null then 1 else 0 end) over (partition by t.key order by t.start_date) as grp
      from t left join
           t tprev
           on tprev.key = t.key and
              tprev.end_date = date_add(t.start_date, -1)
     ) t
group by key, grp;

This logic can be adjusted to handle overlaps, but that does not seem to be needed for your data.

Upvotes: 1

Related Questions