Reputation: 355
I have a huge pandas data frame where each row corresponds to a single sports match. It looks like the following:
**EDIT: I'll change the example code to better reflect the actual data: This made me realize the presence of values other than 'lost' or 'won' makes this a lot more difficult.
d = {'date': ['21.01.96', '22.02.96', '23.02.96', '24.02.96', '25.02.96',
'26.02.96', '27.02.96', '28.02.96', '29.02.96', '30.02.96'],
'challenger': [5, 5, 10, 5, 4, 5, 8, 8, 10, 8],
'opponent': [2, 4, 5, 4, 5, 10, 5, 2, 4, 10],
'outcome': ['win', 'lost', 'declined', 'win', 'declined', 'win', 'declined', 'declined', 'lost', 'lost']
}
df = pd.DataFrame(data=d)
For each matchup I want to calculate previous wins/losses in a new variable. In the example case, the 'prev_wins' variable would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. I did manage to create working code for this, which looks like this:
data['prev_wins_spec_challenger'] = 0
data['prev_losses_spec_challenger'] = 0
data['challenger'] = data['challenger'].astype(str)
data['opponent'] = data['opponent'].astype(str)
data['matchups'] = data['challenger'] + '-' + data['opponent']
# create list of matchups with unique pairings
matchups_temp = list(data['matchups'].unique())
matchups = []
for match in matchups_temp:
if match[::-1] in matchups:
pass
else:
matchups.append(match)
prev_wins = {}
for i in matchups:
prev_wins[i] = 0
prev_losses = {}
for i in matchups:
prev_losses[i] = 0
# go through data set for each matchup and calculate variables
for i in range(0, len(matchups)):
match = matchups[i].split('-')
challenger = match[0]
opponent = match[1]
for index, row in data.iterrows():
if row['challenger'] == challenger and row['opponent'] == opponent:
if row['outcome'] == 'won':
data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
prev_wins[matchups[i]] += 1
elif row['outcome'] == 'lost':
data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
prev_losses[matchups[i]] += 1
elif row['challenger'] == opponent and row['opponent'] == challenger:
if row['outcome'] == 'won':
data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
prev_losses[matchups[i]] += 1
elif row['outcome'] == 'lost':
data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
prev_wins[matchups[i]] += 1
The problem with this is that it takes incredibly long cause there are a total of ~65.000 different matchups and the data frame has ~170.000 rows. On my laptop this would take around 180 hours to run, which is not acceptable.
I am sure there is a better solution for this but even after searching the internet the whole day I was not able to find one. How can I make this code faster?
Upvotes: 3
Views: 117
Reputation: 59264
df['outcome'] = df.outcome.map({'win':1, 'loss':0})
Then
df.groupby('challenger').outcome.cumsum().sub(1).clip(lower=0)
Of course, you don't need to overwrite the values in outcome
(you can create a new column and work with it). But usually in pandas operations are way faster when working with int
s than when working with string
s. So from a performance point-of-view, it is preferable to have 0
and 1
representing wins and losses than having the actual words loss
and win
.
In the last layer, just when you are presenting the information, that's when you map back to human-understandable words. But the processing don't usually need strings
Upvotes: 2
Reputation: 51335
IIUC, you can do something like this, using shift()
to look at the previous outcomes, and getting the cumulative sum of the boolean of where it is equal to win
:
data['previous_wins'] = data.groupby('challenger').outcome.transform(lambda x: x.shift().eq('win').cumsum())
>>> data
challenger date opponent outcome previous_wins
0 5 21.01.96 6 win 0
1 4 22.02.96 3 loss 0
2 5 23.02.96 6 win 1
If you're looking to count how many wins a challenger had against a specific opponent, you can just groupby both the challenger and opponent:
data['previous_wins'] = data.groupby(['opponent','challenger']).outcome.transform(lambda x: x.shift().eq('win').cumsum())
Upvotes: 0