Reputation: 7526
I have two dataframes: df_intervals
containing time spans between two timestamps and df_events
containing single timestamps representing events
df_events, df_intervals
( Var2 ts
0 bar 2021-02-10 09:04:31
1 bar 2021-01-29 05:56:17
2 bar 2021-01-16 15:59:43
3 bar 2021-01-25 09:40:40
4 bar 2021-01-27 16:44:57
5 bar 2021-01-17 13:28:43
6 bar 2021-02-03 11:46:10
7 bar 2021-02-02 11:16:49
8 bar 2021-01-21 17:12:15
9 bar 2021-01-19 03:44:30,
Var1 start_ts end_ts
0 foo 2021-02-01 20:29:57 2021-02-02 20:29:57
1 foo 2021-02-03 20:29:57 2021-02-04 20:29:57
2 foo 2021-02-04 20:29:57 2021-02-05 20:29:57
3 foo 2021-02-05 20:29:57 2021-02-06 20:29:57
4 foo 2021-02-06 20:29:57 2021-02-07 20:29:57
5 foo 2021-02-07 20:29:57 2021-02-08 20:29:57
6 foo 2021-02-08 20:29:57 2021-02-11 20:29:57
7 foo 2021-02-08 20:29:57 2021-02-10 20:29:57
8 foo 2021-02-10 20:29:57 2021-02-11 20:29:57)
My objective is to left join df_intervals
to df_events
if there is a matching record in df_intervals
where the event timestamp ts
falls between the start_ts
and end_ts
. The resulting join would produce a match
boolean column and count
column with the number of matching intervals. The issue I am having is when there are multiple matches in df_intervals
- in this case, if there is at least one match, match
should be true. I have tried with pd.merge_asof()
but am confused about how to use the direction=
parameter with this kind of between query. Any suggestions on how to perform this kind of complex join using pandas or numpy (without using sqlite)?
df_output
Var2 ts match count
bar 1612947871 True 2
bar 1611899777 False 0
bar 1610812783 False 0
bar 1611567640 False 0
bar 1611765897 False 0
bar 1610890123 False 0
bar 1612352770 False 0
bar 1612264609 True 1
bar 1611249135 False 0
bar 1611027870 False 0
Note- I have been able to perform this matching in a for loop, but it requires putting output lists into a matrix and then checking if there are True/False values. I am looking for suggestions for a simpler solution
for event in df_events['ts']:
matches = []
for idx, a, b,c in df_intervals.itertuples():
print(idx,a,b,c)
if b <= event <= c:
matches.append(True)
else:
matches.append(False)
print(matches)
Code to generate the dataframes:
import pandas as pd
import random
intervals = [['foo', 1612211397, 1612297797], \
['foo', 1612384197, 1612470597], ['foo', 1612470597, 1612556997], \
['foo', 1612556997 , 1612643397], ['foo', 1612643397, 1612729797], ['foo', 1612729797, 1612816197], \
['foo', 1612816197, 1613075397], ['foo', 1612816197, 1612988997], ['foo', 1612988997, 1613075397]]
df_intervals = pd.DataFrame(intervals, columns = ['Var1', 'start_ts', 'end_ts'])
df_intervals['start_ts']= pd.to_datetime(df_intervals['start_ts'],unit='s')
df_intervals['end_ts']= pd.to_datetime(df_intervals['end_ts'],unit='s')
ts = [1612947871, 1611899777, 1610812783, 1611567640, 1611765897, 1610890123, 1612352770, 1612264609,1611249135,
1611027870]
bar = ['bar'] * 10
d = {'Var2': bar,'ts':ts}
df_events = pd.DataFrame(d)
df_events['ts']= pd.to_datetime(df_events['ts'],unit='s')
Upvotes: 2
Views: 56
Reputation: 41
This will provide the output as in your example
df_events['count'] = df_events.apply(lambda r: len(df_intervals.loc[(df_intervals['start_ts']<=r['ts']) & (df_intervals['end_ts']>=r['ts'])]), axis=1)
df_events['match'] = df_events['count'] > 0
Upvotes: 1
Reputation: 1012
You can use .apply()
to get the count
column (and match
is True if count != 0
):
df_events['count'] = df_events.apply(lambda x: sum( (x['ts']>=df_intervals['start_ts']) & (x['ts']<=df_intervals['end_ts'])), axis=1)
Upvotes: 0