moys
moys

Reputation: 8033

Sum & count of a column based on the content of the last value in each group excluding the last row

I have a dataframe as below (This is an update of this question here)

id       val    type
aa         0    C
aa         1    T
aa         2    T
aa         3    T
aa         0    M
aa         1    M
aa         2    C
aa         3    M
bbb        0    C
bbb        1    T
bbb        2    T
bbb        3    T
bbb        0    M
bbb        1    M
bbb        2    C
bbb        3    T
cccccc     0    C
cccccc     1    T
cccccc     2    T
cccccc     3    T
cccccc     0    M
cccccc     1    M
cccccc     0    C
cccccc     1    C
dddddddd   3    G

I want to do a groupby "ID" and then sum & count the rows in column "val", however the rows that should be summed are only the rows that contain the "type" same as the last value of column "type" in each group. Also, if there are more than one rows, the value of last row should not be summed or counted. If there is only one row of the last value, then the last row should be summed & counted.

For example the last row of group 'aa' has "type" M, so only the rows with "type" M in the group be summed & counted. However, since there are more than one row with M, only the rows other than the last row should be summed & counted. So, values 0 & 1 needs to be summed & the count is 2.

In another scenario, group 'dddddddd' has only one row, so the sum should be 3 & the count shoud be 1.

The expected output for the df above is as below. Column "type" in the output is not mandatory, it can be left out if it takes more time to get that. I am showing it here just o give clarity on what i want to achieve.

id        val  count   type
aa          1  2       M
bbb         6  3       T
cccccc      0  2       C
dddddddd    3  1       G

Upvotes: 0

Views: 379

Answers (2)

9mat
9mat

Reputation: 1234

Find the last value, and subtracting it from the sum at the end:

last_type = df.groupby("id").tail(1).rename(columns={'val':'last_val'})

res= df.merge(last_type, on=["id", "type"], how="inner").groupby(["id", "type"]).agg(
        val = ('val', 'sum'),
        count = ('val', 'count'),
        last_val = ('last_val', 'first')).reset_index()

multiple = res['count'] > 1
res['val'] -= multiple*res['last_val']
res['count'] -= multiple

res.drop(columns='last_val')

Output:
         id type  val  count
0        aa    M    1      2
1       bbb    T    6      3
2    cccccc    C    0      2
3  dddddddd    G    3      1

Upvotes: 2

jezrael
jezrael

Reputation: 862781

Idea is filter last rows per groups by DataFrame.drop_duplicates and subtract if count is no 1:

df1 = (df[df['type'].eq(df.groupby('id')['type'].transform('last'))]
            .groupby('id').agg(val=('val', 'sum'), 
                               count=('val', 'size'), 
                               type=('type','last')))
print (df1)
          val  count type
id                       
aa          4      3    M
bbb         9      4    T
cccccc      1      3    C
dddddddd    3      1    G

s = df.drop_duplicates('id', keep='last').set_index('id')['val']
m = df1['count'] != 1

df1['val'] -= np.where(m, s, 0)
df1['count'] -= np.where(m, 1, 0)

print (df1)
          val  count type
id                       
aa          1      2    M
bbb         6      3    T
cccccc      0      2    C
dddddddd    3      1    G

Another solution:

cols = ['val','count']
df2 = (df.drop_duplicates('id', keep='last')
         .set_index('id')
         .assign(count=1)[cols])

df1[cols] = df1[cols].sub(df2.where(df1['count'] != 1, 0))
print (df1)
          val  count type
id                       
aa          1      2    M
bbb         6      3    T
cccccc      0      2    C
dddddddd    3      1    G

Detail:

print (df2)
          val  count
id                  
aa          3      1
bbb         3      1
cccccc      1      1
dddddddd    3      1

Upvotes: 2

Related Questions