Reputation: 1439
Let's say I have a df
pd.DataFrame(
{'name':['pam','pam','bob','bob','pam','bob','pam','bob'],
'game_id':[0,0,1,1,0,2,1,2]
}
)
name game_id
0 pam 0
1 pam 0
2 bob 1
3 bob 1
4 pam 0
5 bob 2
6 pam 1
7 bob 2
I want to calculate how many games bob and amy have appeared in cumulatively. However, when I use .groupby()
and .cumcount()+1
, I get something different. I get a cumulative count within each game_id
:
df['games'] = df.groupby(['name','game_id']).cumcount()+1
name game_id games
0 pam 0 1
1 pam 0 2
2 bob 1 1
3 bob 1 2
4 pam 0 3
5 bob 2 1
6 pam 1 1
7 bob 2 2
When what I really want is a one total cumulative count rather than a cumulative count for each unique game_id
. Here's an example of my desired output:
name game_id games
0 pam 0 1
1 pam 0 1
2 bob 1 1
3 bob 1 1
4 pam 0 1
5 bob 2 2
6 pam 1 2
7 bob 2 2
Note, in my actual dataset game_id
is a random sequence of numbers.
Upvotes: 4
Views: 7864
Reputation: 682
One line alternative.
The two essential steps performed are:
df.duplicated
groupby.cumsum
on step #1 column to get the cumulative unique countdf['games'] = df.assign(temp=~df.duplicated(subset=['name','game_id'])).groupby('name')['temp'].cumsum()
name game_id games
0 pam 0 1
1 pam 0 1
2 bob 1 1
3 bob 1 1
4 pam 0 1
5 bob 2 2
6 pam 1 2
7 bob 2 2
Upvotes: 3
Reputation: 26676
Lets try sort df, check consecutive difference, create new group by cumsum and then resort the df
new_df=df.sort_values(by=['name','game_id'])
new_df=new_df.assign(rank=new_df['game_id']!=new_df['game_id'].shift())
new_df=new_df.assign(rank=new_df.groupby('name')['rank'].cumsum()).sort_index()
print(new_df)
name game_id rank
0 pam 0 1
1 pam 0 1
2 bob 1 1
3 bob 1 1
4 pam 0 1
5 bob 2 2
6 pam 1 2
7 bob 2 2
Upvotes: 2