robroc
robroc

Reputation: 1212

pandas: join dataframes based on time interval

I have a data frame with a datetime column every 10 minutes and a numerical value:

df1 = pd.DataFrame({'time' : pd.date_range('1/1/2018', periods=20, freq='10min'), 'value' : np.random.randint(2, 20, size=20)})

And another with a schedule of events, with a start time and end time. There can be multiple events happening at the same time:

df2 = pd.DataFrame({'start_time' : ['2018-01-01 00:00:00', '2018-01-01 00:00:00','2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00' ], 'end_time' : ['2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00','2018-01-01 02:00:00', '2018-01-01 02:00:00', '2018-01-01 03:00:00'], 'event' : ['A', 'B', 'C', 'D', 'E', 'F'] })
df2[['start_time', 'end_time']] = df2.iloc[:,0:2].apply(pd.to_datetime)

I want to do a left join on df1, with all events that fall inside the start and end times. My output table should be:

                  time  value event
0  2018-01-01 00:00:00      5     A
1  2018-01-01 00:00:00      5     B
2  2018-01-01 00:10:00     15     A
3  2018-01-01 00:10:00     15     B
4  2018-01-01 00:20:00     16     A
5  2018-01-01 00:20:00     16     B
.....
17 2018-01-01 02:50:00      7     F

I attempted these SO solutions, but they fail because of duplicate time intervals.

Upvotes: 4

Views: 3900

Answers (4)

sammywemmy
sammywemmy

Reputation: 28659

One option is with the conditional_join from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

out = df1.conditional_join(
            df2, 
           ('time', 'start_time', '>='), 
           ('time', 'end_time', '<=')
        )

out.head()
                 time  value start_time            end_time event
0 2018-01-01 00:00:00     14 2018-01-01 2018-01-01 01:00:00     A
1 2018-01-01 00:00:00     14 2018-01-01 2018-01-01 01:00:00     B
2 2018-01-01 00:10:00     10 2018-01-01 2018-01-01 01:00:00     A
3 2018-01-01 00:10:00     10 2018-01-01 2018-01-01 01:00:00     B
4 2018-01-01 00:20:00     15 2018-01-01 2018-01-01 01:00:00     A

Upvotes: 1

user3483203
user3483203

Reputation: 51165

Setup (Only using a few entries from df1 for brevity):

df1 = pd.DataFrame({'time' : pd.date_range('1/1/2018', periods=20, freq='10min'), 'value' : np.random.randint(2, 20, size=20)})
df2 = pd.DataFrame({'start_time' : ['2018-01-01 00:00:00', '2018-01-01 00:00:00','2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00' ], 'end_time' : ['2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00','2018-01-01 02:00:00', '2018-01-01 02:00:00', '2018-01-01 03:00:00'], 'event' : ['A', 'B', 'C', 'D', 'E', 'F'] })

df1 = df1.sample(5)
df2[['start_time', 'end_time']] = df2.iloc[:,0:2].apply(pd.to_datetime)

You can use a couple of straightfoward list comprehensions to achieve your result. This answer assumes that all date columns are in fact, of type datetime in your DataFrame:

Step 1
Find all events that occur within a particular time range using a list comprehension and simple interval checking:

packed = list(zip(df2.start_time, df2.end_time, df2.event))
df1['event'] = [[ev for strt, end, ev in packed if strt <= el <= end] for el in df1.time]

                  time  value      event
2  2018-01-01 00:20:00      8     [A, B]
14 2018-01-01 02:20:00     14        [F]
8  2018-01-01 01:20:00      6  [C, D, E]
19 2018-01-01 03:10:00     16         []
4  2018-01-01 00:40:00      7     [A, B]

Step 2:

Finally, explode each list from the last result to a new row using another list comprehension:

pd.DataFrame(
    [[t, val, e] for t, val, event in zip(df1.time, df1.value, df1.event)
    for e in event
    ], columns=df1.columns
)

Output:

                 time  value event
0 2018-01-01 00:20:00      8     A
1 2018-01-01 00:20:00      8     B
2 2018-01-01 02:20:00     14     F
3 2018-01-01 01:20:00      6     C
4 2018-01-01 01:20:00      6     D
5 2018-01-01 01:20:00      6     E
6 2018-01-01 00:40:00      7     A
7 2018-01-01 00:40:00      7     B

Upvotes: 5

Ben.T
Ben.T

Reputation: 29635

You can work on df2 to create a column with all the time with a resampling '10min' (like in df1) for each event, and then use merge. It's a lot of manipulation so probably not the most efficient.

df2_manip = (df2.set_index('event').stack().reset_index().set_index(0)
                .groupby('event').resample('10T').ffill().reset_index(1))

and df2_manip looks like:

                        0 event     level_1
event                                      
A     2018-01-01 00:00:00     A  start_time
A     2018-01-01 00:10:00     A  start_time
A     2018-01-01 00:20:00     A  start_time
A     2018-01-01 00:30:00     A  start_time
A     2018-01-01 00:40:00     A  start_time
A     2018-01-01 00:50:00     A  start_time
A     2018-01-01 01:00:00     A    end_time
B     2018-01-01 00:00:00     B  start_time
B     2018-01-01 00:10:00     B  start_time
B     2018-01-01 00:20:00     B  start_time
B     2018-01-01 00:30:00     B  start_time
...

Now you can merge:

df1 = df1.merge(df2_manip[[0, 'event']].rename(columns={0:'time'}))

and you get df1:

                  time  value event
0  2018-01-01 00:00:00      9     A
1  2018-01-01 00:00:00      9     B
2  2018-01-01 00:10:00     16     A
3  2018-01-01 00:10:00     16     B
...
33 2018-01-01 02:00:00      6     D
34 2018-01-01 02:00:00      6     E
35 2018-01-01 02:00:00      6     F
36 2018-01-01 02:10:00      2     F
37 2018-01-01 02:20:00     18     F
38 2018-01-01 02:30:00     14     F
39 2018-01-01 02:40:00      5     F
40 2018-01-01 02:50:00      3     F
41 2018-01-01 03:00:00      9     F

Upvotes: 0

AZhao
AZhao

Reputation: 14405

I'm not entirely sure of your question, but if you are trying to join on "events that fall inside the start and end times," then sounds like you need something akin to a "between" operator from SQL. You're data doesn't make it particularly clear.

Pandas doesn't have this natively, but Pandasql does. It allows you to run sqlite against you're dataframe. I think something like this is what you need:

import pandasql as ps

sqlcode = '''
select *
from df1
inner join df2 on df1.event=df2.event
where df2.time >= d1.start_time and df2.fdate <= d1.stop_time
'''

newdf = ps.sqldf(sqlcode,locals())

Relevant Question: Merge pandas dataframes where one value is between two others

Upvotes: 1

Related Questions