Keithx
Keithx

Reputation: 3148

Joining Pandas dataframes based on time intervals and calculating average values

Have a tricky question:

There are two dataframes 'TimeRanges' where the information about ranges of time is in (start date and end date with ID) like this:

ID  StartTime       EndTime
1   01.03.18 12:00  01.03.18 13:00 
2   01.03.18 13:00  01.03.18 13:15 
3   01.03.18 13:30  01.03.18 14:55 

The second dataframe contains Time column with the time values increasing with frequency of one minute and the column Values like this:

Time            Value
01.03.18 12:00  5.00
01.03.18 12:01  20.00
01.03.18 12:02  5.00
01.03.18 13:10  30.00
01.03.18 14:20  45.00

What I try to achieve and don't know how to come closer to the task is that I want to create new column AvgValue in dataframe TimeRanges which will contain mean() function of the Values those Times are in the interval between StartTime and EndTime for example:

ID  StartTime       EndTime         AvgValue
1   01.03.18 12:00  01.03.18 13:00     10
2   01.03.18 13:00  01.03.18 13:15     30
3   01.03.18 13:30  01.03.18 14:55     45

*The value 10 because it's in the interval from 01.03.18 12:00 to 01.03.18 13:00 (01.03.18 12:00, 01.03.18 12:01, 01.03.18 12:02) and because of that we calculate the mean only for these values.

What will be approach to do that? lambdas functions? or smth else?

Thanks

Upvotes: 1

Views: 369

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

One option is with the conditional_join from pyjanitor.

# pip install pyjanitor
import pandas as pd
import janitor

(df2
.conditional_join(
    df1, 
    ('Time', 'StartTime', '>='), 
    ('Time', 'EndTime', '<='))
.groupby(['ID', 'StartTime', 'EndTime'], as_index = False)
.Value
.mean()
)

   ID           StartTime             EndTime  Value
0   1 2018-01-03 12:00:00 2018-01-03 13:00:00   10.0
1   2 2018-01-03 13:00:00 2018-01-03 13:15:00   30.0
2   3 2018-01-03 13:30:00 2018-01-03 14:55:00   45.0

Upvotes: 0

gyx-hh
gyx-hh

Reputation: 1431

I achieved this with resample but it also requires a bit of fiddling so it might not be the best solution. First of all we need to the index to be of type DatetimeIndex, TimedeltaIndex or PeriodIndex.

# set Time to be index
df.set_index('Time', inplace=True)
# change index type to datetime
df.index = pd.to_datetime(df.index)

using resample - I have used the rule of 60 minutes. You can look at the rules in the follwoing link resample

new_df = df.resample('60T').mean().reset_index()

now we have a new_df with average values every 60 minutes. we just need to do the following to have it in the format you want.

from datetime import timedelta    
new_df['EndTime'] = new_df['Time'] + timedelta(seconds=3600)

and finally renaming the columns:

new_df.rename(columns={'Time': 'StartTime', 'Value': 'AvgValue'}, inplace=True)

Output:

    StartTime             AvgValue    EndTime
0   2018-01-03 12:00:00   10.0        2018-01-03 13:00:00
1   2018-01-03 13:00:00   30.0        2018-01-03 14:00:00
2   2018-01-03 14:00:00   45.0        2018-01-03 15:00:00

EDIT: This time using the first dataframe (df1) for the time ranges you can do the following

df1['AvgTime'] = df1.T.apply(lambda x: df.loc[x['StartTime']:x['EndTime']].mean()).T

Upvotes: 1

Related Questions