Group by based on repeating column values of panda df to sum/average other columns

Question

I have a df looking like this:

    Activity    Count   angle_1frame_abs
87  11.2454 4   1.9863239600400613
88  14.3124 4   1.633204419481332
89  7.15621 4   1.7235925045363631
90  1.02232 4   1.4205234792290875
172 6.13389 1   1.9096280055821166
187 1.02232 3   1.7052938965382456
188 7.15621 3   1.708833899378485
189 2.04463 3   1.2728507985832682
233 4.08926 1   1.554572584797844
265 4.08926 2   1.512615236089327
266 5.11157 2   1.4850900583919704
281 6.13389 1   1.162132169753371
305 3.06694 2   2.3605660470439824
306 3.06694 2   1.5685525010916657
385 5.11157 2   1.6579646804948973
386 2.04463 2   2.121520877298791
407 5.11157 4   1.1528498264361269
408 12.2678 4   1.7986876725933032
409 9.20082 4   1.5502484587771188
410 2.04463 4   1.6302871732665316

The column "Count" has stretches of repeating values that I would like to use to group the df.

In the above example I would have 8 groups.

I struggle keeping repeating values as separate groups.

My final aim would be to either average (Activity) or sum (angle_1frame_abs) the other columns based on these groups.

My desired output would be something like this (note: the activity and angle_1frames_abs values are made up):

Activity Count angle_1_frames_abs
9   4   1.7
6   1   1.9 
4   3   1.7
4   1   1.5
4   2   1.4
6   1   1.1
4   2   1.5
8   4   1.5

I have been trying something like this but this only gave me groups based on unique values in the "count" column.

df.groupby(["Count"]).angle_1frame_abs.sum().reset_index()

user7864386 · Accepted Answer

You can create groups using diff + ne + cumsum. The idea is to check if the value appears consecutively and if not, check where it breaks. Then cumsum is used to assign distinct numerical value to groups:

groups = df['Count'].diff().ne(0).cumsum()

From the given input, this creates:

Note that this assigns "Counts" to different groups even if they are the same as long as they are not consecutive.

Then you can use groupby on df by these new groups. For example, the average of "Activity" for each group is:

out = df.groupby(groups)['Activity'].mean()

Count
1    8.434083
2    6.133890
3    3.407720
4    4.089260
5    4.600415
6    6.133890
7    3.322520
8    7.156205

Also, in case the original grouping column is not numeric (or a mix of dtypes), an alternative would be to use shift (thanks @StevenS):

df['count'].ne(df['count'].shift()).fillna(0).cumsum().astype(int)

Group by based on repeating column values of panda df to sum/average other columns

Answers (2)

Related Questions