Reputation: 77
I have a df looking like this:
Activity Count angle_1frame_abs
87 11.2454 4 1.9863239600400613
88 14.3124 4 1.633204419481332
89 7.15621 4 1.7235925045363631
90 1.02232 4 1.4205234792290875
172 6.13389 1 1.9096280055821166
187 1.02232 3 1.7052938965382456
188 7.15621 3 1.708833899378485
189 2.04463 3 1.2728507985832682
233 4.08926 1 1.554572584797844
265 4.08926 2 1.512615236089327
266 5.11157 2 1.4850900583919704
281 6.13389 1 1.162132169753371
305 3.06694 2 2.3605660470439824
306 3.06694 2 1.5685525010916657
385 5.11157 2 1.6579646804948973
386 2.04463 2 2.121520877298791
407 5.11157 4 1.1528498264361269
408 12.2678 4 1.7986876725933032
409 9.20082 4 1.5502484587771188
410 2.04463 4 1.6302871732665316
The column "Count" has stretches of repeating values that I would like to use to group the df.
In the above example I would have 8 groups.
I struggle keeping repeating values as separate groups.
My final aim would be to either average (Activity) or sum (angle_1frame_abs) the other columns based on these groups.
My desired output would be something like this (note: the activity and angle_1frames_abs values are made up):
Activity Count angle_1_frames_abs
9 4 1.7
6 1 1.9
4 3 1.7
4 1 1.5
4 2 1.4
6 1 1.1
4 2 1.5
8 4 1.5
I have been trying something like this but this only gave me groups based on unique values in the "count" column.
df.groupby(["Count"]).angle_1frame_abs.sum().reset_index()
Upvotes: 1
Views: 395
Reputation: 573
You could define every column you want to keep on your aggregation function:
df.groupby(["Count"]).agg({k: ['sum', 'mean'] for k in df.columns}).reset_index()
This here code calculates the sum and average for every column per group, hence this will work only with a dataframe that has only numeric columns.
Here is a version that will work for streches of groups:
# firstly, create a helper column called "groups" (this col will signify
# when a value is changed in the Count column):
df['groups'] = None
group = 0
for i, j in df.iterrows():
if i == 0:
df.loc[i, 'groups'] = group
valueBefore = df.loc[i, 'Count']
continue
if j.loc['Count'] != valueBefore:
group += 1
df.loc[i, 'groups'] = group
valueBefore = df.loc[i, 'Count']
# then you can use it to groupby all other columns, i.e.:
df.groupby(["groups"]).agg({k: ['sum', 'mean'] for k in df.columns}).reset_index()
Upvotes: 0
Reputation:
You can create groups using diff
+ ne
+ cumsum
. The idea is to check if the value appears consecutively and if not, check where it breaks. Then cumsum
is used to assign distinct numerical value to groups:
groups = df['Count'].diff().ne(0).cumsum()
From the given input, this creates:
87 1
88 1
89 1
90 1
172 2
187 3
188 3
189 3
233 4
265 5
266 5
281 6
305 7
306 7
385 7
386 7
407 8
408 8
409 8
410 8
Note that this assigns "Counts" to different groups even if they are the same as long as they are not consecutive.
Then you can use groupby
on df
by these new groups. For example, the average of "Activity" for each group is:
out = df.groupby(groups)['Activity'].mean()
Count
1 8.434083
2 6.133890
3 3.407720
4 4.089260
5 4.600415
6 6.133890
7 3.322520
8 7.156205
Also, in case the original grouping column is not numeric (or a mix of dtypes), an alternative would be to use shift (thanks @StevenS):
df['count'].ne(df['count'].shift()).fillna(0).cumsum().astype(int)
Upvotes: 3