lollerskates
lollerskates

Reputation: 1147

How to join two dataframes with overlapping time windows and matching IDs in pandas

Suppose I have the following dataframe that keeps track of when test times start and end:

import pandas as pd
from datetime import datetime
dfA = pd.DataFframe({'test_id': [1,2], 
                    'start_time': [datetime.strptime("2019-06-01 04:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-03 13:12:00", "%Y-%m-%d %H:%M:%S")],
                     'end_time': [datetime.strptime("2019-06-01 06:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-03 15:29:00", "%Y-%m-%d %H:%M:%S")]})
dfB = pd.DataFframe({'test_id': [1,3], 
                    'start_time': [datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-01 00:00:00", "%Y-%m-%d %H:%M:%S")],
                     'end_time': [datetime.strptime("2019-06-01 05:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")]})

I would like to perform the SQL equivalent

select * from A
inner join B
on (A.start_time between B.start_time AND B.start_time
   OR A.end_time between B.start_time AND B.start_time
   OR B.start_time between A.start_time AND A.start_time
   OR B.end_time between A.start_time AND A.start_time)
  AND A.id = B.id

in Pandas. From this post, I learned that pandas does not support this type of join and I will have to use numpy.where like so:

# get the start and end times for both dataframes
Astart_time = dfA.start_time.values # a
Aend_time = dfA.end_time.values # b

Bstart_time = dfB.start_time.values # c
Bend_time = dfB.end_time.values # d

# We need to JOIN both pandas dataframe where there are overlapping
# timeframes. We check for these overlaps:
# (c <= a < d) OR (c <= b < d) OR (a <= c < b) OR (a <= d < b)
# sql equivalent of a INNER JOIN ON BETWEEN a range of values
A_records, B_records = np.where(((Astart_time[:, None] >= Bstart_time) & (Astart_time[:, None] < Bend_time))\
                              | ((Aend_time[:, None] >= Bstart_time) & (Aend_time[:, None] < Bend_time))\
                              | ((Astart_time[:, None] <= Bstart_time) & (Astart_time[:, None] > Bend_time))\
                              | ((Astart_time[:, None] <= Bend_time) & (Aend_time[:, None] > Bend_time)))

However I cannot figure out how to add the condition that A.test_id == B.test_id in the numpy where clause. I am expecting only the records where test_id == 1 to join from both dataframes A and B. The reason I want to add this extra condition into the np.where clause is because my dataframes will contain several million records each and I do not want them to blow up my machine's memory when performing the join.

Upvotes: 1

Views: 176

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

There's a query for that:

(dfA.merge(dfB, on='test_id', suffixes=['_a', '_b'])
   .query('start_time_b <= start_time_a <= end_time_b | ' +
          'start_time_b <= end_time_a <= end_time_b   | ' +
          'start_time_a <= start_time_b <= end_time_a | ' +
          'start_time_a <= end_time_b <= end_time_a'
   )
)

Upvotes: 1

Related Questions