Tahnee
Tahnee

Reputation: 77

Group by based on repeating column values of panda df to sum/average other columns

I have a df looking like this:

    Activity    Count   angle_1frame_abs
87  11.2454 4   1.9863239600400613
88  14.3124 4   1.633204419481332
89  7.15621 4   1.7235925045363631
90  1.02232 4   1.4205234792290875
172 6.13389 1   1.9096280055821166
187 1.02232 3   1.7052938965382456
188 7.15621 3   1.708833899378485
189 2.04463 3   1.2728507985832682
233 4.08926 1   1.554572584797844
265 4.08926 2   1.512615236089327
266 5.11157 2   1.4850900583919704
281 6.13389 1   1.162132169753371
305 3.06694 2   2.3605660470439824
306 3.06694 2   1.5685525010916657
385 5.11157 2   1.6579646804948973
386 2.04463 2   2.121520877298791
407 5.11157 4   1.1528498264361269
408 12.2678 4   1.7986876725933032
409 9.20082 4   1.5502484587771188
410 2.04463 4   1.6302871732665316 

The column "Count" has stretches of repeating values that I would like to use to group the df.

In the above example I would have 8 groups.

I struggle keeping repeating values as separate groups.

My final aim would be to either average (Activity) or sum (angle_1frame_abs) the other columns based on these groups.

My desired output would be something like this (note: the activity and angle_1frames_abs values are made up):

Activity Count angle_1_frames_abs
9   4   1.7
6   1   1.9 
4   3   1.7
4   1   1.5
4   2   1.4
6   1   1.1
4   2   1.5
8   4   1.5

I have been trying something like this but this only gave me groups based on unique values in the "count" column.

df.groupby(["Count"]).angle_1frame_abs.sum().reset_index()

Upvotes: 1

Views: 395

Answers (2)

Mario
Mario

Reputation: 573

You could define every column you want to keep on your aggregation function:

df.groupby(["Count"]).agg({k: ['sum', 'mean'] for k in df.columns}).reset_index()

This here code calculates the sum and average for every column per group, hence this will work only with a dataframe that has only numeric columns.
Here is a version that will work for streches of groups:

# firstly, create a helper column called "groups" (this col will signify
# when a value is changed in the Count column):
df['groups'] = None
group = 0
for i, j in df.iterrows():
    if i == 0:
        df.loc[i, 'groups'] = group
        valueBefore = df.loc[i, 'Count']
        continue
    if j.loc['Count'] != valueBefore:
        group += 1
    df.loc[i, 'groups'] = group
    valueBefore = df.loc[i, 'Count']
# then you can use it to groupby all other columns, i.e.:
df.groupby(["groups"]).agg({k: ['sum', 'mean'] for k in df.columns}).reset_index()

Upvotes: 0

user7864386
user7864386

Reputation:

You can create groups using diff + ne + cumsum. The idea is to check if the value appears consecutively and if not, check where it breaks. Then cumsum is used to assign distinct numerical value to groups:

groups = df['Count'].diff().ne(0).cumsum()

From the given input, this creates:

87     1
88     1
89     1
90     1
172    2
187    3
188    3
189    3
233    4
265    5
266    5
281    6
305    7
306    7
385    7
386    7
407    8
408    8
409    8
410    8

Note that this assigns "Counts" to different groups even if they are the same as long as they are not consecutive.

Then you can use groupby on df by these new groups. For example, the average of "Activity" for each group is:

out = df.groupby(groups)['Activity'].mean()

Count
1    8.434083
2    6.133890
3    3.407720
4    4.089260
5    4.600415
6    6.133890
7    3.322520
8    7.156205

Also, in case the original grouping column is not numeric (or a mix of dtypes), an alternative would be to use shift (thanks @StevenS):

df['count'].ne(df['count'].shift()).fillna(0).cumsum().astype(int)

Upvotes: 3

Related Questions