Miguel Wang
Miguel Wang

Reputation: 385

Pandas get average from second dataframe of the last N rows within time interval

I have 2 DFs:

DF1

Name Timestamp Value
object 1 2021-11-01 10:00:00 1.
object 1 2021-11-01 11:00:00 1.5
object 2 2021-11-01 10:30:00 1.7

DF2

Name Timestamp feature
object 1 2021-11-01 8:00:00 0.9
object 1 2021-11-01 9:00:00 1.1
object 1 2021-11-01 9:30:00 1.3
object 1 2021-11-01 12:00:00 1.
object 2 2021-11-01 10:00:00 1.3
object 2 2021-11-01 11:30:00 1.9

Per row from DF1, I would like to get the rolling average of the last N rows from DF2 that have the same Name and the Timestamp is smaller than the row I am considering. (Say N=2 in this example)

Example output should look like:

Name Timestamp Value AVG of feature
object 1 2021-11-01 10:00:00 1. (1.1 + 1.3)/2
object 1 2021-11-01 11:00:00 1.5 (1.1 + 1.3)/2
object 2 2021-11-01 10:30:00 1.7 1.3

Ideally I would be able to do even weighted averages depending on time differences. For example

Name Timestamp Value AVG of feature
object 1 2021-11-01 10:00:00 1. (60min * 1.1 + 30min * 1.3)/(2 * 90min)
object 1 2021-11-01 11:00:00 1.5 (120min * 1.1 + 90min * 1.3)/(2 * 210min)
object 2 2021-11-01 10:30:00 1.7 1.3

IMPORTANT: My issue is that doing an DF1.apply is taking really long as I have big dataframes (DF1 is around twice as big as DF2). I believe the most important bottleneck is on how to find the biggest timestamp in DF2 that is smaller than the current row in DF1

Upvotes: 0

Views: 153

Answers (1)

mozway
mozway

Reputation: 260890

You need to use pandas.merge_asof to align the timestamps:

df1.join(pd
  .merge_asof(df2.sort_values(by='Timestamp'),
              df1.sort_values(by='Timestamp')
                 .reset_index()
                 .drop(columns='Value')
                 .rename(columns={'Timestamp': 'TS'}),
              by='Name', left_on='Timestamp', right_on='TS',
              direction='forward')
  .assign(weight=lambda d: d['TS'].sub(d['Timestamp']).dt.total_seconds(),
          feature=lambda d: d['feature'].mul(d['weight'])
         )
  .groupby('index').apply(lambda g: g['feature'].sum()/g['weight'].sum()/len(g))
  .rename('AVG of (feature)')
)

output:

         Name           Timestamp  Value  AVG of (feature)
0.0  object 1 2021-11-01 10:00:00    1.0          0.583333
1.0  object 1 2021-11-01 11:00:00    1.5               NaN
2.0  object 2 2021-11-01 10:30:00    1.7          1.300000

NB. if you want to propagate the previous value of the AVG you can use ffill per group

Upvotes: 1

Related Questions