bismo
bismo

Reputation: 1439

Pandas groupby cumcount - one cumulative count rather than a cumulative count for each unique value

Let's say I have a df

pd.DataFrame(
    {'name':['pam','pam','bob','bob','pam','bob','pam','bob'],
     'game_id':[0,0,1,1,0,2,1,2]
     }
    )


   name game_id
0   pam 0
1   pam 0
2   bob 1
3   bob 1
4   pam 0
5   bob 2
6   pam 1
7   bob 2

I want to calculate how many games bob and amy have appeared in cumulatively. However, when I use .groupby() and .cumcount()+1, I get something different. I get a cumulative count within each game_id:

df['games'] = df.groupby(['name','game_id']).cumcount()+1

    name game_id games
0   pam 0   1
1   pam 0   2
2   bob 1   1
3   bob 1   2
4   pam 0   3
5   bob 2   1
6   pam 1   1
7   bob 2   2

When what I really want is a one total cumulative count rather than a cumulative count for each unique game_id. Here's an example of my desired output:

    name game_id games
0   pam 0   1
1   pam 0   1
2   bob 1   1
3   bob 1   1
4   pam 0   1
5   bob 2   2
6   pam 1   2
7   bob 2   2

Note, in my actual dataset game_id is a random sequence of numbers.

Upvotes: 4

Views: 7864

Answers (2)

StevenS
StevenS

Reputation: 682

One line alternative.

The two essential steps performed are:

  1. Determine unique rows by taking the inverse of df.duplicated
  2. groupby.cumsum on step #1 column to get the cumulative unique count
df['games'] = df.assign(temp=~df.duplicated(subset=['name','game_id'])).groupby('name')['temp'].cumsum()

  name  game_id  games
0  pam        0      1
1  pam        0      1
2  bob        1      1
3  bob        1      1
4  pam        0      1
5  bob        2      2
6  pam        1      2
7  bob        2      2

Upvotes: 3

wwnde
wwnde

Reputation: 26676

Lets try sort df, check consecutive difference, create new group by cumsum and then resort the df

new_df=df.sort_values(by=['name','game_id'])
new_df=new_df.assign(rank=new_df['game_id']!=new_df['game_id'].shift())
new_df=new_df.assign(rank=new_df.groupby('name')['rank'].cumsum()).sort_index()
print(new_df)



 name  game_id  rank
0  pam        0     1
1  pam        0     1
2  bob        1     1
3  bob        1     1
4  pam        0     1
5  bob        2     2
6  pam        1     2
7  bob        2     2

Upvotes: 2

Related Questions