Reputation: 8033
I have a dataframe as below (This is an update of this question here)
id val type
aa 0 C
aa 1 T
aa 2 T
aa 3 T
aa 0 M
aa 1 M
aa 2 C
aa 3 M
bbb 0 C
bbb 1 T
bbb 2 T
bbb 3 T
bbb 0 M
bbb 1 M
bbb 2 C
bbb 3 T
cccccc 0 C
cccccc 1 T
cccccc 2 T
cccccc 3 T
cccccc 0 M
cccccc 1 M
cccccc 0 C
cccccc 1 C
dddddddd 3 G
I want to do a groupby
"ID" and then sum
& count
the rows in column "val", however the rows that should be summed are only the rows that contain the "type" same as the last value of column "type" in each group. Also, if there are more than one rows, the value of last row should not be summed or counted. If there is only one row of the last value, then the last row should be summed & counted.
For example the last row of group 'aa' has "type" M, so only the rows with "type" M in the group be summed & counted. However, since there are more than one row with M, only the rows other than the last row should be summed & counted. So, values 0 & 1 needs to be summed & the count is 2.
In another scenario, group 'dddddddd' has only one row, so the sum should be 3 & the count shoud be 1.
The expected output for the df above is as below. Column "type" in the output is not mandatory, it can be left out if it takes more time to get that. I am showing it here just o give clarity on what i want to achieve.
id val count type
aa 1 2 M
bbb 6 3 T
cccccc 0 2 C
dddddddd 3 1 G
Upvotes: 0
Views: 379
Reputation: 1234
Find the last value, and subtracting it from the sum at the end:
last_type = df.groupby("id").tail(1).rename(columns={'val':'last_val'})
res= df.merge(last_type, on=["id", "type"], how="inner").groupby(["id", "type"]).agg(
val = ('val', 'sum'),
count = ('val', 'count'),
last_val = ('last_val', 'first')).reset_index()
multiple = res['count'] > 1
res['val'] -= multiple*res['last_val']
res['count'] -= multiple
res.drop(columns='last_val')
Output:
id type val count
0 aa M 1 2
1 bbb T 6 3
2 cccccc C 0 2
3 dddddddd G 3 1
Upvotes: 2
Reputation: 862781
Idea is filter last rows per groups by DataFrame.drop_duplicates
and subtract if count is no 1
:
df1 = (df[df['type'].eq(df.groupby('id')['type'].transform('last'))]
.groupby('id').agg(val=('val', 'sum'),
count=('val', 'size'),
type=('type','last')))
print (df1)
val count type
id
aa 4 3 M
bbb 9 4 T
cccccc 1 3 C
dddddddd 3 1 G
s = df.drop_duplicates('id', keep='last').set_index('id')['val']
m = df1['count'] != 1
df1['val'] -= np.where(m, s, 0)
df1['count'] -= np.where(m, 1, 0)
print (df1)
val count type
id
aa 1 2 M
bbb 6 3 T
cccccc 0 2 C
dddddddd 3 1 G
Another solution:
cols = ['val','count']
df2 = (df.drop_duplicates('id', keep='last')
.set_index('id')
.assign(count=1)[cols])
df1[cols] = df1[cols].sub(df2.where(df1['count'] != 1, 0))
print (df1)
val count type
id
aa 1 2 M
bbb 6 3 T
cccccc 0 2 C
dddddddd 3 1 G
Detail:
print (df2)
val count
id
aa 3 1
bbb 3 1
cccccc 1 1
dddddddd 3 1
Upvotes: 2