python noob
python noob

Reputation: 87

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:

    id  fruit
0   7   NaN
1   7   apple
2   7   NaN
3   7   mango
4   7   apple
5   7   potato
6   3   berry
7   3   olive
8   3   olive
9   3   grape
10  3   NaN
11  3   mango
12  3   potato

In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.

Output I want:

   id   fruit sum_last3
0   7   NaN     0
1   7   apple   1
2   7   NaN     1
3   7   mango   2
4   7   apple   2
5   7   potato  2
6   3   berry   1
7   3   olive   2
8   3   olive   3
9   3   grape   3
10  3   NaN     2
11  3   mango   2
12  3   potato  1

My Code:

df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values

Upvotes: 3

Views: 321

Answers (1)

SeaBean
SeaBean

Reputation: 23237

You can modify your codes slightly, as follows:

df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
                      .groupby(df['id'],sort=False)
                      .rolling(min_periods=1, window=3).sum().astype(int)
                      .droplevel(0)
                  )

or use .values as in your codes:

df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
                      .groupby(df['id'],sort=False)
                      .rolling(min_periods=1, window=3).sum().astype(int)
                      .values
                  )

Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).

Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.

Result:

print(df)

    id   fruit  sum_last3
0    7     NaN          0
1    7   apple          1
2    7     NaN          1
3    7   mango          2
4    7   apple          2
5    7  potato          2
6    3   berry          1
7    3   olive          2
8    3   olive          3
9    3   grape          3
10   3     NaN          2
11   3   mango          2
12   3  potato          1

Upvotes: 2

Related Questions