CTXR
CTXR

Reputation: 199

Pandas Rolling Conditional Function

I'm having some trouble to use .apply or .aggregate in pandas on a rolling basis (assuming of course that it is the right way to solve my problem). Let's assume I have a dataframe with two columns A and B. I would like to create a column C that will contain the rolling mean of B if A is having the value of 1. And more generally I would like to be able to apply a custom function on a rolling basis with some conditions involving several columns of the dataframe (e.g. rolling sum of column A when B > x and/or C = y etc.).

import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A':[1,1,1,0,0,0,1,1,1],'B': [50,40,50,-20,20,10,10,-5,-2]}, index = np.arange(9))

Desired output would be (assuming a rolling window of 3):

df2 = pd.DataFrame({'A':[1,1,1,0,0,0,1,1,1],'B': [50,40,50,-20,20,10,10,-5,-2],\
 'C': [np.nan, np.nan, 46.67, 45, 50, np.nan, 10, 2.50, 1]}, index = np.arange(9))

I have tried to define a function mean_1 as follows:

def mean_1(x):
    return np.where(x['A'] == 1, np.mean(x['B']), np.nan)

df2['C'] = df2.rolling(3).apply(mean_1)

and got the error: 'Series' object has no attribute 'A' I guess it is due related to the raw = False in the documentation Thanks

Upvotes: 2

Views: 1024

Answers (3)

You can first mask the 'B' values where 'A' is not 1, then apply the rolling method:

mask_map = df2.A != 1
df2['C'] = df2.B.mask(mask_map).rolling(3, min_periods=1).mean().round(2)

Output:

   A   B      C
0  1  50  50.00
1  1  40  45.00
2  1  50  46.67
3  0 -20  45.00
4  0  20  50.00
5  0  10    NaN
6  1  10  10.00
7  1  -5   2.50
8  1  -2   1.00

Note that the first values are not NaN because we specified min_periods=1. It means that we're taking the mean regardless the quantity of missing values. So, if that's the case, and if you really want to set the first values as NaN, this can be done with:

df2.iloc[:n-1, df2.columns.get_loc('C')] = np.nan

where n is the window size (3 in this case). This will return the exact desired output.

Best!

Upvotes: 2

Georgina Skibinski
Georgina Skibinski

Reputation: 13407

You can vectorize your solution:

df2['C'] = df2['A'].eq(1).mul(df2['B']).rolling(3).sum()\
    .div(df2['A'].eq(1).rolling(3).sum())\
    .round(2)

If you are asking more general in terms of any function - my advise would be - always try to vectorize, generally avoid .apply(...)

Upvotes: 1

Nabil Daoud
Nabil Daoud

Reputation: 221

Here is a way to get close to what you're desired output.

df2['C'] = df2.apply(lambda row: np.where(row['A']==1, row['B'], np.nan), axis=1).rolling(3, min_periods=1).apply(np.nanmean)

The difference is that the above give a value for index 0 and 1.

Upvotes: 1

Related Questions