Reputation: 385
I have 2 DFs:
DF1
Name | Timestamp | Value |
---|---|---|
object 1 | 2021-11-01 10:00:00 | 1. |
object 1 | 2021-11-01 11:00:00 | 1.5 |
object 2 | 2021-11-01 10:30:00 | 1.7 |
DF2
Name | Timestamp | feature |
---|---|---|
object 1 | 2021-11-01 8:00:00 | 0.9 |
object 1 | 2021-11-01 9:00:00 | 1.1 |
object 1 | 2021-11-01 9:30:00 | 1.3 |
object 1 | 2021-11-01 12:00:00 | 1. |
object 2 | 2021-11-01 10:00:00 | 1.3 |
object 2 | 2021-11-01 11:30:00 | 1.9 |
Per row from DF1, I would like to get the rolling average of the last N rows from DF2 that have the same Name
and the Timestamp
is smaller than the row I am considering. (Say N=2 in this example)
Example output should look like:
Name | Timestamp | Value | AVG of feature |
---|---|---|---|
object 1 | 2021-11-01 10:00:00 | 1. | (1.1 + 1.3)/2 |
object 1 | 2021-11-01 11:00:00 | 1.5 | (1.1 + 1.3)/2 |
object 2 | 2021-11-01 10:30:00 | 1.7 | 1.3 |
Ideally I would be able to do even weighted averages depending on time differences. For example
Name | Timestamp | Value | AVG of feature |
---|---|---|---|
object 1 | 2021-11-01 10:00:00 | 1. | (60min * 1.1 + 30min * 1.3)/(2 * 90min) |
object 1 | 2021-11-01 11:00:00 | 1.5 | (120min * 1.1 + 90min * 1.3)/(2 * 210min) |
object 2 | 2021-11-01 10:30:00 | 1.7 | 1.3 |
IMPORTANT: My issue is that doing an DF1.apply
is taking really long as I have big dataframes (DF1 is around twice as big as DF2). I believe the most important bottleneck is on how to find the biggest timestamp in DF2 that is smaller than the current row in DF1
Upvotes: 0
Views: 153
Reputation: 260890
You need to use pandas.merge_asof
to align the timestamps:
df1.join(pd
.merge_asof(df2.sort_values(by='Timestamp'),
df1.sort_values(by='Timestamp')
.reset_index()
.drop(columns='Value')
.rename(columns={'Timestamp': 'TS'}),
by='Name', left_on='Timestamp', right_on='TS',
direction='forward')
.assign(weight=lambda d: d['TS'].sub(d['Timestamp']).dt.total_seconds(),
feature=lambda d: d['feature'].mul(d['weight'])
)
.groupby('index').apply(lambda g: g['feature'].sum()/g['weight'].sum()/len(g))
.rename('AVG of (feature)')
)
output:
Name Timestamp Value AVG of (feature)
0.0 object 1 2021-11-01 10:00:00 1.0 0.583333
1.0 object 1 2021-11-01 11:00:00 1.5 NaN
2.0 object 2 2021-11-01 10:30:00 1.7 1.300000
NB. if you want to propagate the previous value of the AVG you can use ffill
per group
Upvotes: 1