Reputation: 37
I have a very large table of data containing play_id
, position
and frame
columns.
play_id
- refers to a particular play
position
- either A or B (with 3 total players (rows) for each given play - any combination of As and Bs)
frame
- time frame (e.g. imagine freeze-frame every 1 second)
The following shows example data (simplified for clarity):
play_id | position | frame |
---|---|---|
1 | A | 1 |
1 | A | 1 |
1 | B | 1 |
1 | A | 2 |
1 | A | 2 |
1 | B | 2 |
2 | A | 1 |
2 | B | 1 |
2 | B | 1 |
2 | A | 2 |
2 | B | 2 |
2 | B | 2 |
I want to count the number of players in each position for each play_id (they are consistent across frames for a given play_id) and append a number to each to make them unique.
This would result in the following:
play_id | position | frame |
---|---|---|
1 | A_1 | 1 |
1 | A_2 | 1 |
1 | B_1 | 1 |
1 | A_1 | 2 |
1 | A_2 | 2 |
1 | B_1 | 2 |
2 | A_1 | 1 |
2 | B_1 | 1 |
2 | B_2 | 1 |
2 | A_1 | 2 |
2 | B_1 | 2 |
2 | B_2 | 2 |
In reality I have 7 different positions, 25 frames and around 500,000 play_ids (along with many more columns).
How can I achieve this in an efficient manner? I assume groupby
should be used, but I can't figure out how.
Upvotes: 0
Views: 32
Reputation: 88305
You could groupby.cumcount
, to group by all columns and add a count for the positions:
df['position'] = df.position.str.cat(df.groupby(['play_id','frame','position']).position
.cumcount().add(1).astype(str),
sep='_')
print(df)
play_id position frame
0 1 A_1 1
1 1 A_2 1
2 1 B_1 1
3 1 A_1 2
4 1 A_2 2
5 1 B_1 2
6 2 A_1 1
7 2 B_1 1
8 2 B_2 1
9 2 A_1 2
10 2 B_1 2
11 2 B_2 2
Upvotes: 1