nika
nika

Reputation: 3

pandas - add a column of the mean of the last 3 elements in groupby

I have a dataframe of several columns, which I sorted, grouped by index and calculated the difference between each row and the next one in the group. Next I want to add a column of the means of the last 3 differences. For example:

index  A  B    A_diff  B_diff    A_diff_last3mean    B_diff_last3mean 
1111   1  2      0       0               NaN              NaN
1111   1  2      0       0               NaN              NaN                 
1111   2  4      1       2               0.33             0.67
1111   4  6      2       2                1               1.33
2222   5  7     NaN     NaN              NaN              NaN #index changed
2222   2  8     -3       1               NaN              NaN 

I managed to create such columns using

df=df.join(df.groupby(['index'],sort=False,as_index=False).diff(),rsuffix='_diff')
y=df.groupby(['index'],sort=False,as_index=False).nth([-1,-2,-3])
z=y.groupby(['index'],sort=False,as_index=False).mean()

but that creates an aggregated dataframe, and I need the values to be merged in the original one. I tried with the .transform() function and did not succeed much. Would really appreciate your help.

Upvotes: 0

Views: 77

Answers (1)

Alicia Garcia-Raboso
Alicia Garcia-Raboso

Reputation: 13913

import io
import pandas as pd

data = io.StringIO('''\
group  A  B
1111   1  2
1111   1  2
1111   2  4
1111   4  6
2222   5  7
2222   2  8
''')
df = pd.read_csv(data, delim_whitespace=True)

diff = (df.groupby('group')
          .diff()
          .fillna(0)
          .add_suffix('_diff'))
df = df.join(diff)

last3mean = (df.groupby('group')[diff.columns]
               .rolling(3).mean()  
               .reset_index(drop=True)
               .add_suffix('_last3mean'))
df = df.join(last3mean)
print(df)

Output:

   group  A  B  A_diff  B_diff  A_diff_last3mean  B_diff_last3mean
0   1111  1  2     0.0     0.0               NaN               NaN
1   1111  1  2     0.0     0.0               NaN               NaN
2   1111  2  4     1.0     2.0          0.333333          0.666667
3   1111  4  6     2.0     2.0          1.000000          1.333333
4   2222  5  7     0.0     0.0               NaN               NaN
5   2222  2  8    -3.0     1.0               NaN               NaN

Notes:

  • Although index is a perfectly valid column name, pandas DataFrames have indices too. To avoid confusion, I have renamed that column to group.

  • In your desired output, you seem to have filled the NaNs in columns A_diff and B_diff for the group 1111 but not for the group 2222. The first line in your code snippet does not perform such filling. I have filled them all — .fillna(0) in the definition of diff, but you can drop that if you want.

Upvotes: 1

Related Questions