Reputation: 8768
Let's say we have the following dataframe. If we wanted to find the count of consecutive 1's, you could use the below.
col
0 0
1 1
2 1
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 1
11 1
12 1
13 0
14 1
15 1
df['col'].groupby(df['col'].diff().ne(0).cumsum()).cumsum()
But the problem I see is when you need to use groupby
with and id field. If we added an id
field to the dataframe (below), it makes it more complicated. We can no longer use the solution above.
id col
0 B 0
1 B 1
2 B 1
3 B 1
4 A 0
5 A 0
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
When presented with this issue, ive seen the case for making a helper series to use in the groupby like this:
s = df['col'].eq(0).groupby(df['id']).cumsum()
df['col'].groupby([df['id'],s]).cumsum()
Which works, but the problem is that the first group contains the first row, which does not fit the criteria. This usually isn't a problem, but it is if we wanted to find the count. Replacing cumsum()
at the end of the last groupby()
with .transform('count')
would actually give us 6
instead of 5
for the count of consecutive 1's in the first B group.
The only solution I can come up with for this problem is the following code:
df['col'].groupby([df['id'],df.groupby('id')['col'].transform(lambda x: x.diff().ne(0).astype(int).cumsum())]).transform('count')
Expected output:
0 1
1 5
2 5
3 5
4 2
5 2
6 5
7 5
8 1
9 2
10 2
11 2
12 2
13 1
14 2
15 2
This works, but uses transform()
twice, which I heard isn't the fastest. It is the only solution I can think of that uses diff().ne(0)
to get the "real" groups.
Index 1,2,3,6 and 7 are all id B, with the same value in the 'col'
column, so the count would not be reset, so they would all be apart of the same group.
Can this be done without using multiple .transform()
?
Upvotes: 2
Views: 424
Reputation: 153460
After looking at @TrentonMcKinney solution, I came up with:
df = df.sort_values(['id'])
grp =(df[['id','col']] != df[['id','col']].shift()).any(axis=1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df.sort_index()
Output:
id col count
0 B 0 1
1 B 1 5
2 B 1 5
3 B 1 5
4 A 0 2
5 A 0 2
6 B 1 5
7 B 1 5
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
IIUC, do you want?
grp = (df[['id', 'col']] != df[['id', 'col']].shift()).any(axis = 1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df
Output:
id col count
0 B 0 1
1 B 1 3
2 B 1 3
3 B 1 3
4 A 0 2
5 A 0 2
6 B 1 2
7 B 1 2
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
Upvotes: 2
Reputation: 62373
.transform()
, and relies upon ordering the index, to get the correct counts.
cum_counts['cum_counts']
to get the exact desired output, without the other column.import pandas as pd
# test data as shown in OP
df = pd.DataFrame({'id': ['B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A'], 'col': [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
# reset the index, then set the index and sort
df = df.reset_index().set_index(['index', 'id']).sort_index(level=1)
col
index id
4 A 0
5 A 0
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
0 B 0
1 B 1
2 B 1
3 B 1
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
# get the cumulative sum
g = df.col.ne(df.col.shift()).cumsum()
# use g to groupby and use only 1 transform to get the counts
cum_counts = df['col'].groupby(g).transform('count').reset_index(level=1, name='cum_counts').sort_index()
id cum_counts
index
0 B 1
1 B 5
2 B 5
3 B 5
4 A 2
5 A 2
6 B 5
7 B 5
8 B 1
9 B 2
10 B 2
11 A 2
12 A 2
13 A 1
14 A 2
15 A 2
Upvotes: 2