Is there a way to go through rows in a pandas data frame more efficiently?

Question

I have a huge pandas data frame where each row corresponds to a single sports match. It looks like the following:

**EDIT: I'll change the example code to better reflect the actual data: This made me realize the presence of values other than 'lost' or 'won' makes this a lot more difficult.

d = {'date': ['21.01.96', '22.02.96', '23.02.96', '24.02.96', '25.02.96',
          '26.02.96', '27.02.96', '28.02.96', '29.02.96', '30.02.96'], 
     'challenger': [5, 5, 10, 5, 4, 5, 8, 8, 10, 8],
     'opponent': [2, 4, 5, 4, 5, 10, 5, 2, 4, 10],
     'outcome': ['win', 'lost', 'declined', 'win', 'declined', 'win', 'declined', 'declined', 'lost', 'lost']
     }
df = pd.DataFrame(data=d)

For each matchup I want to calculate previous wins/losses in a new variable. In the example case, the 'prev_wins' variable would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. I did manage to create working code for this, which looks like this:

data['prev_wins_spec_challenger'] = 0
data['prev_losses_spec_challenger'] = 0               

data['challenger'] = data['challenger'].astype(str)
data['opponent'] = data['opponent'].astype(str)

data['matchups'] = data['challenger'] + '-' + data['opponent']

# create list of matchups with unique pairings
matchups_temp = list(data['matchups'].unique())
matchups = []
for match in matchups_temp:
    if match[::-1] in matchups:
        pass
    else:
        matchups.append(match)

prev_wins = {}
for i in matchups:
    prev_wins[i] = 0

prev_losses = {}
for i in matchups:
    prev_losses[i] = 0

# go through data set for each matchup and calculate variables
for i in range(0, len(matchups)):
    match = matchups[i].split('-')
    challenger = match[0]
    opponent = match[1]
    for index, row in data.iterrows():
        if row['challenger'] == challenger and row['opponent'] == opponent:
            if row['outcome'] == 'won':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
        elif row['challenger'] == opponent and row['opponent'] == challenger:
            if row['outcome'] == 'won':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1

The problem with this is that it takes incredibly long cause there are a total of ~65.000 different matchups and the data frame has ~170.000 rows. On my laptop this would take around 180 hours to run, which is not acceptable.

I am sure there is a better solution for this but even after searching the internet the whole day I was not able to find one. How can I make this code faster?

rafaelc · Accepted Answer

IIUC, groupby and cumsum

df['outcome'] = df.outcome.map({'win':1, 'loss':0})

Then

df.groupby('challenger').outcome.cumsum().sub(1).clip(lower=0)

Of course, you don't need to overwrite the values in outcome (you can create a new column and work with it). But usually in pandas operations are way faster when working with ints than when working with strings. So from a performance point-of-view, it is preferable to have 0 and 1 representing wins and losses than having the actual words loss and win.

In the last layer, just when you are presenting the information, that's when you map back to human-understandable words. But the processing don't usually need strings

Is there a way to go through rows in a pandas data frame more efficiently?

Answers (2)

Related Questions