faster pandas solution when working with 60,000 row dataframe, groupby and apply

Question

I have dataset for matches and home team result at the current match

match_date  home    away    home_result   
2021-11-22  team1   team2   Win
2021-11-22  team3   team4   Win 
2021-11-23  team1   team8   Lose
2021-11-23  team6   team7   Win
2021-11-25  team1   team2   Win 
2021-11-25  team3   team8   Lose 
2021-11-25  team1   team5   Lose 
2021-11-25  team6   team5   Win 
2021-11-28  team3   team1   Lose 
2021-11-29  team1   team5   Win 
2021-11-29  team6   team9   Win

I have a code to create new column where I can put previous result for each home team before the current match:

df['home_team_previous_results'] = (
    df.groupby('home')
    .apply(
        lambda x: pd.Series(
            [
                [
                    tuple([row[col] for col in ['home_result', 'match_date']])
                    for _, row in x.iloc[0:i].iterrows()
                ] or np.nan
                for i in range(len(x))
            ],
        index=x.index)
    ).droplevel(0)
)

this is the output:

match_date  home    away    home_result   home_team_previous_results
2021-11-22  team1   team2   Win           NaN
2021-11-22  team3   team4   Win           NaN
2021-11-23  team1   team8   Lose          [("Win","2021-11-22")]  
2021-11-23  team6   team7   Win           NaN 
2021-11-25  team1   team2   Win           [("Win","2021-11-22"), ("Lose","2021-11-23")]
2021-11-25  team3   team8   Lose          [("Win","2021-11-22")]
2021-11-25  team1   team5   Lose          [("Win","2021-11-22"), ("Lose","2021-11-23"), ("Win","2021-11-25")]
2021-11-25  team6   team5   Win           [("Win","2021-11-23")]
2021-11-28  team3   team1   Lose          [("Win","2021-11-22"), ("Lose","2021-11-25")]
2021-11-29  team1   team5   Win           [("Win","2021-11-22"), ("Lose","2021-11-23"), ("Win","2021-11-25"), ("Lose","2021-11-25")]
2021-11-29  team6   team9   Win           [("Win","2021-11-23"), ("Win","2021-11-25")]

The issue is that the code takes a long time with a large dataset (more than 60,000 rows) and I need it to run very, very fast. Any idea how to make it faster or better version of it ?

faster pandas solution when working with 60,000 row dataframe, groupby and apply

Answers (1)

Related Questions