helloimgeorgia
helloimgeorgia

Reputation: 361

Pandas groupby rolling mean, but only for the most recent row to save calculation time

I have a dataframe that is very large, in this format:

id     time     price
ABC    01:04     100
XYZ    01:04      50
QRS    01:04      25
ABC    01:03     100
XYZ    01:03      50
QRS    01:03      25
ABC    01:02     100
XYZ    01:02      50
QRS    01:02      25
ABC    01:01     100
XYZ    01:01      50
QRS    01:01      25
ABC    01:00     100
XYZ    01:00      50
QRS    01:00      25

In this case, 1:04pm is the most recent data, and I want a rolling mean of price for each ticker, going back 4 periods. I don't care about the rolling mean starting at 1:03pm or 1:02pm or 1:01pm.

I have this code so far which takes care of everything except it calculates means for all time periods, not just the most recent one:

rmean = db.groupby('id').rolling(window=3).mean()

The reason I need to only calculate it for the first time period is because my actual data is massive and it takes too long.

Expected output:

id     time     price
ABC    01:04     100
XYZ    01:04      50
QRS    01:04      25

Upvotes: 0

Views: 807

Answers (1)

Amit Vikram Singh
Amit Vikram Singh

Reputation: 2128

You can use .groupby.agg, to compute the rolling mean for only the recent data you can take head(3) and compute mean of it.

Use:

new_df = (df.sort_values(by=['time'], ascending = False)
            .groupby('id', as_index = False)
            .agg(
              time = ('time', 'first'), 
              price = ('price', lambda x: x.head(3).values.mean())
             ))

Prints:

>>> new_df
    id   time  price
0  ABC  01:04    100
1  QRS  01:04     25
2  XYZ  01:04     50

Upvotes: 2

Related Questions