Craig
Craig

Reputation: 1985

SQL Find all events that occurred within 1 day of each other

I have a table as such in Hive: dt (timestamp), event_id(string)

I'm looking for all events that happened within (say) one day of each other for the same session. Assume that a single session can span over multiple days.

   session         dt                    event_id
       1      2016-06-10 00:07:33     u38443jdnksdjoeoewk
       2      2016-06-11 11:11:41     u39839jdijewenkfjij
       1      2016-06-15 13:07:42     u38443jdjdksdnlqpma
       2      2016-06-12 05:08:07     u38443jmcmsdjwewekh

In this case the resultant set would be

2016-06-11 11:11:41     u39839jdijewenkfjij
2016-06-12 05:08:07     u38443jmcmsdjwewekh

as the other rows' timestamps were beyond a day. This looks to be doable via a self join but that would be very expensive. Is there a better way?

Upvotes: 2

Views: 2421

Answers (2)

Ben Hoffman
Ben Hoffman

Reputation: 8259

Your best method to get all events within a day of any given even is to use a self join. Computationally, it would not be that expensive. I tested it out on a table with 7000 records and when matching at plus or minus 60 minutes it took 3 seconds and returned 176,588 results. I used DATEDIFF for comparison or you can use DATEADD and compare to both limits. Here are both examples using your table:

SELECT h1.session, h1.event_id, h2.session, h2.event_id, h1.dt, h2.dt
FROM Hive h1
    JOIN Hive h2 ON DATEDIFF(hh, h1.dt, h2.dt) < 24


SELECT h1.session, h1.event_id, h2.session, h2.event_id, h1.dt, h2.dt
FROM Hive h1
    JOIN Hive h2 ON DATEADD(d,1,h1.dt) >= h2.dt
        AND DATEADD(d,-1,h1.dt) <= h2.dt

Upvotes: 2

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

Use lag and lead to get the previous and next row's dates and subtract it with the current row's date. Check to see if either of them is within one day of the current row's date.

select session,dt,event_id 
from (
select dt,event_id,session,
unix_timestamp(dt)-unix_timestamp(lag(dt) over(partition by session order by dt)) prev_diff,
unix_timestamp(lead(dt) over(partition by session order by dt))-unix_timestamp(dt) next_diff
from t
) x
where prev_diff <= 86400 or next_diff <= 86400

Upvotes: 1

Related Questions