Reputation: 1985
I have a table as such in Hive: dt (timestamp), event_id(string)
I'm looking for all events that happened within (say) one day of each other for the same session. Assume that a single session can span over multiple days.
session dt event_id
1 2016-06-10 00:07:33 u38443jdnksdjoeoewk
2 2016-06-11 11:11:41 u39839jdijewenkfjij
1 2016-06-15 13:07:42 u38443jdjdksdnlqpma
2 2016-06-12 05:08:07 u38443jmcmsdjwewekh
In this case the resultant set would be
2016-06-11 11:11:41 u39839jdijewenkfjij
2016-06-12 05:08:07 u38443jmcmsdjwewekh
as the other rows' timestamps were beyond a day. This looks to be doable via a self join but that would be very expensive. Is there a better way?
Upvotes: 2
Views: 2421
Reputation: 8259
Your best method to get all events within a day of any given even is to use a self join. Computationally, it would not be that expensive. I tested it out on a table with 7000 records and when matching at plus or minus 60 minutes it took 3 seconds and returned 176,588 results. I used DATEDIFF for comparison or you can use DATEADD and compare to both limits. Here are both examples using your table:
SELECT h1.session, h1.event_id, h2.session, h2.event_id, h1.dt, h2.dt
FROM Hive h1
JOIN Hive h2 ON DATEDIFF(hh, h1.dt, h2.dt) < 24
SELECT h1.session, h1.event_id, h2.session, h2.event_id, h1.dt, h2.dt
FROM Hive h1
JOIN Hive h2 ON DATEADD(d,1,h1.dt) >= h2.dt
AND DATEADD(d,-1,h1.dt) <= h2.dt
Upvotes: 2
Reputation: 49260
Use lag
and lead
to get the previous and next row's dates and subtract it with the current row's date. Check to see if either of them is within one day of the current row's date.
select session,dt,event_id
from (
select dt,event_id,session,
unix_timestamp(dt)-unix_timestamp(lag(dt) over(partition by session order by dt)) prev_diff,
unix_timestamp(lead(dt) over(partition by session order by dt))-unix_timestamp(dt) next_diff
from t
) x
where prev_diff <= 86400 or next_diff <= 86400
Upvotes: 1