Reputation: 1147
Suppose I have the following dataframe that keeps track of when test times start and end:
import pandas as pd
from datetime import datetime
dfA = pd.DataFframe({'test_id': [1,2],
'start_time': [datetime.strptime("2019-06-01 04:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-03 13:12:00", "%Y-%m-%d %H:%M:%S")],
'end_time': [datetime.strptime("2019-06-01 06:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-03 15:29:00", "%Y-%m-%d %H:%M:%S")]})
dfB = pd.DataFframe({'test_id': [1,3],
'start_time': [datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-01 00:00:00", "%Y-%m-%d %H:%M:%S")],
'end_time': [datetime.strptime("2019-06-01 05:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")]})
I would like to perform the SQL equivalent
select * from A
inner join B
on (A.start_time between B.start_time AND B.start_time
OR A.end_time between B.start_time AND B.start_time
OR B.start_time between A.start_time AND A.start_time
OR B.end_time between A.start_time AND A.start_time)
AND A.id = B.id
in Pandas. From this post, I learned that pandas does not support this type of join and I will have to use numpy.where
like so:
# get the start and end times for both dataframes
Astart_time = dfA.start_time.values # a
Aend_time = dfA.end_time.values # b
Bstart_time = dfB.start_time.values # c
Bend_time = dfB.end_time.values # d
# We need to JOIN both pandas dataframe where there are overlapping
# timeframes. We check for these overlaps:
# (c <= a < d) OR (c <= b < d) OR (a <= c < b) OR (a <= d < b)
# sql equivalent of a INNER JOIN ON BETWEEN a range of values
A_records, B_records = np.where(((Astart_time[:, None] >= Bstart_time) & (Astart_time[:, None] < Bend_time))\
| ((Aend_time[:, None] >= Bstart_time) & (Aend_time[:, None] < Bend_time))\
| ((Astart_time[:, None] <= Bstart_time) & (Astart_time[:, None] > Bend_time))\
| ((Astart_time[:, None] <= Bend_time) & (Aend_time[:, None] > Bend_time)))
However I cannot figure out how to add the condition that A.test_id == B.test_id
in the numpy where clause. I am expecting only the records where test_id == 1 to join from both dataframes A and B. The reason I want to add this extra condition into the np.where
clause is because my dataframes will contain several million records each and I do not want them to blow up my machine's memory when performing the join.
Upvotes: 1
Views: 176
Reputation: 150785
There's a query
for that:
(dfA.merge(dfB, on='test_id', suffixes=['_a', '_b'])
.query('start_time_b <= start_time_a <= end_time_b | ' +
'start_time_b <= end_time_a <= end_time_b | ' +
'start_time_a <= start_time_b <= end_time_a | ' +
'start_time_a <= end_time_b <= end_time_a'
)
)
Upvotes: 1