Reputation: 199
I'm having some trouble to use .apply
or .aggregate
in pandas on a rolling basis (assuming of course that it is the right way to solve my problem). Let's assume I have a dataframe with two columns A and B. I would like to create a column C that will contain the rolling mean of B if A is having the value of 1. And more generally I would like to be able to apply a custom function on a rolling basis with some conditions involving several columns of the dataframe (e.g. rolling sum of column A when B > x and/or C = y etc.).
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A':[1,1,1,0,0,0,1,1,1],'B': [50,40,50,-20,20,10,10,-5,-2]}, index = np.arange(9))
Desired output would be (assuming a rolling window of 3):
df2 = pd.DataFrame({'A':[1,1,1,0,0,0,1,1,1],'B': [50,40,50,-20,20,10,10,-5,-2],\
'C': [np.nan, np.nan, 46.67, 45, 50, np.nan, 10, 2.50, 1]}, index = np.arange(9))
I have tried to define a function mean_1
as follows:
def mean_1(x):
return np.where(x['A'] == 1, np.mean(x['B']), np.nan)
df2['C'] = df2.rolling(3).apply(mean_1)
and got the error: 'Series' object has no attribute 'A'
I guess it is due related to the raw = False
in the documentation
Thanks
Upvotes: 2
Views: 1024
Reputation: 4929
You can first mask the 'B' values where 'A' is not 1, then apply the rolling method:
mask_map = df2.A != 1
df2['C'] = df2.B.mask(mask_map).rolling(3, min_periods=1).mean().round(2)
Output:
A B C
0 1 50 50.00
1 1 40 45.00
2 1 50 46.67
3 0 -20 45.00
4 0 20 50.00
5 0 10 NaN
6 1 10 10.00
7 1 -5 2.50
8 1 -2 1.00
Note that the first values are not NaN
because we specified min_periods=1
. It means that we're taking the mean regardless the quantity of missing values. So, if that's the case, and if you really want to set the first values as NaN
, this can be done with:
df2.iloc[:n-1, df2.columns.get_loc('C')] = np.nan
where n
is the window size (3 in this case). This will return the exact desired output.
Best!
Upvotes: 2
Reputation: 13407
You can vectorize your solution:
df2['C'] = df2['A'].eq(1).mul(df2['B']).rolling(3).sum()\
.div(df2['A'].eq(1).rolling(3).sum())\
.round(2)
If you are asking more general in terms of any function - my advise would be - always try to vectorize, generally avoid .apply(...)
Upvotes: 1
Reputation: 221
Here is a way to get close to what you're desired output.
df2['C'] = df2.apply(lambda row: np.where(row['A']==1, row['B'], np.nan), axis=1).rolling(3, min_periods=1).apply(np.nanmean)
The difference is that the above give a value for index 0 and 1.
Upvotes: 1