How to get the cumulative count based on two columns

Question

Let's say we have the following dataframe. If we wanted to find the count of consecutive 1's, you could use the below.

    col
0   0
1   1
2   1
3   1
4   0
5   0
6   1
7   1
8   0
9   1
10  1
11  1
12  1
13  0
14  1
15  1


df['col'].groupby(df['col'].diff().ne(0).cumsum()).cumsum()

But the problem I see is when you need to use groupby with and id field. If we added an id field to the dataframe (below), it makes it more complicated. We can no longer use the solution above.

When presented with this issue, ive seen the case for making a helper series to use in the groupby like this:

s = df['col'].eq(0).groupby(df['id']).cumsum()
df['col'].groupby([df['id'],s]).cumsum()

Which works, but the problem is that the first group contains the first row, which does not fit the criteria. This usually isn't a problem, but it is if we wanted to find the count. Replacing cumsum() at the end of the last groupby() with .transform('count') would actually give us 6 instead of 5 for the count of consecutive 1's in the first B group.

The only solution I can come up with for this problem is the following code:

df['col'].groupby([df['id'],df.groupby('id')['col'].transform(lambda x: x.diff().ne(0).astype(int).cumsum())]).transform('count')

Expected output:

This works, but uses transform() twice, which I heard isn't the fastest. It is the only solution I can think of that uses diff().ne(0) to get the "real" groups.

Index 1,2,3,6 and 7 are all id B, with the same value in the 'col' column, so the count would not be reset, so they would all be apart of the same group.

Can this be done without using multiple .transform()?

Trenton McKinney · Accepted Answer

The following code uses only 1 .transform(), and relies upon ordering the index, to get the correct counts.
- The original index is kept, so the final result can be reindexed back to the original order.
Use cum_counts['cum_counts'] to get the exact desired output, without the other column.

import pandas as pd

# test data as shown in OP
df = pd.DataFrame({'id': ['B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A'], 'col': [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})

# reset the index, then set the index and sort
df = df.reset_index().set_index(['index', 'id']).sort_index(level=1) 

          col
index id     
4     A     0
5     A     0
11    A     1
12    A     1
13    A     0
14    A     1
15    A     1
0     B     0
1     B     1
2     B     1
3     B     1
6     B     1
7     B     1
8     B     0
9     B     1
10    B     1

# get the cumulative sum
g = df.col.ne(df.col.shift()).cumsum()

# use g to groupby and use only 1 transform to get the counts
cum_counts = df['col'].groupby(g).transform('count').reset_index(level=1, name='cum_counts').sort_index()

      id  cum_counts
index               
0      B           1
1      B           5
2      B           5
3      B           5
4      A           2
5      A           2
6      B           5
7      B           5
8      B           1
9      B           2
10     B           2
11     A           2
12     A           2
13     A           1
14     A           2
15     A           2

How to get the cumulative count based on two columns

Answers (2)

Related Questions