iskandarblue
iskandarblue

Reputation: 7526

Joining dataframes based overlapping timespans with aggregation

I have two dataframes: df_intervals containing time spans between two timestamps and df_events containing single timestamps representing events

    df_events, df_intervals

(  Var2                  ts
 0  bar 2021-02-10 09:04:31
 1  bar 2021-01-29 05:56:17
 2  bar 2021-01-16 15:59:43
 3  bar 2021-01-25 09:40:40
 4  bar 2021-01-27 16:44:57
 5  bar 2021-01-17 13:28:43
 6  bar 2021-02-03 11:46:10
 7  bar 2021-02-02 11:16:49
 8  bar 2021-01-21 17:12:15
 9  bar 2021-01-19 03:44:30,
   Var1            start_ts              end_ts
 0  foo 2021-02-01 20:29:57 2021-02-02 20:29:57
 1  foo 2021-02-03 20:29:57 2021-02-04 20:29:57
 2  foo 2021-02-04 20:29:57 2021-02-05 20:29:57
 3  foo 2021-02-05 20:29:57 2021-02-06 20:29:57
 4  foo 2021-02-06 20:29:57 2021-02-07 20:29:57
 5  foo 2021-02-07 20:29:57 2021-02-08 20:29:57
 6  foo 2021-02-08 20:29:57 2021-02-11 20:29:57
 7  foo 2021-02-08 20:29:57 2021-02-10 20:29:57
 8  foo 2021-02-10 20:29:57 2021-02-11 20:29:57)

My objective is to left join df_intervals to df_events if there is a matching record in df_intervals where the event timestamp ts falls between the start_ts and end_ts. The resulting join would produce a match boolean column and count column with the number of matching intervals. The issue I am having is when there are multiple matches in df_intervals - in this case, if there is at least one match, match should be true. I have tried with pd.merge_asof() but am confused about how to use the direction= parameter with this kind of between query. Any suggestions on how to perform this kind of complex join using pandas or numpy (without using sqlite)?

    df_output
Var2         ts  match  count
 bar 1612947871   True      2
 bar 1611899777  False      0
 bar 1610812783  False      0
 bar 1611567640  False      0
 bar 1611765897  False      0
 bar 1610890123  False      0
 bar 1612352770  False      0
 bar 1612264609   True      1
 bar 1611249135  False      0
 bar 1611027870  False      0

Note- I have been able to perform this matching in a for loop, but it requires putting output lists into a matrix and then checking if there are True/False values. I am looking for suggestions for a simpler solution

for event in df_events['ts']:
    matches = []
    for idx, a, b,c in df_intervals.itertuples():
        print(idx,a,b,c)
        if b <= event <= c:
            matches.append(True)
        else:
            matches.append(False)
    print(matches)
            

Code to generate the dataframes:

import pandas as pd 
import random

intervals = [['foo', 1612211397, 1612297797], \
        ['foo',  1612384197, 1612470597], ['foo', 1612470597, 1612556997], \
        ['foo', 1612556997 , 1612643397], ['foo',  1612643397, 1612729797], ['foo', 1612729797, 1612816197], \
        ['foo', 1612816197,  1613075397], ['foo',  1612816197, 1612988997], ['foo', 1612988997, 1613075397]] 
   
df_intervals = pd.DataFrame(intervals, columns = ['Var1',  'start_ts', 'end_ts']) 
   
df_intervals['start_ts']= pd.to_datetime(df_intervals['start_ts'],unit='s')
df_intervals['end_ts']= pd.to_datetime(df_intervals['end_ts'],unit='s')



ts = [1612947871, 1611899777, 1610812783, 1611567640, 1611765897, 1610890123, 1612352770, 1612264609,1611249135,
 1611027870]
bar = ['bar'] * 10

d = {'Var2': bar,'ts':ts}
df_events = pd.DataFrame(d)

df_events['ts']= pd.to_datetime(df_events['ts'],unit='s')

Upvotes: 2

Views: 56

Answers (2)

dh_vie
dh_vie

Reputation: 41

This will provide the output as in your example

df_events['count'] = df_events.apply(lambda r: len(df_intervals.loc[(df_intervals['start_ts']<=r['ts']) & (df_intervals['end_ts']>=r['ts'])]), axis=1)
df_events['match'] = df_events['count'] > 0

Upvotes: 1

Rick M
Rick M

Reputation: 1012

You can use .apply() to get the count column (and match is True if count != 0):

df_events['count'] = df_events.apply(lambda x: sum( (x['ts']>=df_intervals['start_ts']) & (x['ts']<=df_intervals['end_ts'])), axis=1)

Upvotes: 0

Related Questions