Khaled Koubaa
Khaled Koubaa

Reputation: 527

faster pandas solution when working with 60,000 row dataframe, groupby and apply

I have dataset for matches and home team result at the current match

match_date  home    away    home_result   
2021-11-22  team1   team2   Win
2021-11-22  team3   team4   Win 
2021-11-23  team1   team8   Lose
2021-11-23  team6   team7   Win
2021-11-25  team1   team2   Win 
2021-11-25  team3   team8   Lose 
2021-11-25  team1   team5   Lose 
2021-11-25  team6   team5   Win 
2021-11-28  team3   team1   Lose 
2021-11-29  team1   team5   Win 
2021-11-29  team6   team9   Win 

I have a code to create new column where I can put previous result for each home team before the current match:

df['home_team_previous_results'] = (
    df.groupby('home')
    .apply(
        lambda x: pd.Series(
            [
                [
                    tuple([row[col] for col in ['home_result', 'match_date']])
                    for _, row in x.iloc[0:i].iterrows()
                ] or np.nan
                for i in range(len(x))
            ],
        index=x.index)
    ).droplevel(0)
)

this is the output:

match_date  home    away    home_result   home_team_previous_results
2021-11-22  team1   team2   Win           NaN
2021-11-22  team3   team4   Win           NaN
2021-11-23  team1   team8   Lose          [("Win","2021-11-22")]  
2021-11-23  team6   team7   Win           NaN 
2021-11-25  team1   team2   Win           [("Win","2021-11-22"), ("Lose","2021-11-23")]
2021-11-25  team3   team8   Lose          [("Win","2021-11-22")]
2021-11-25  team1   team5   Lose          [("Win","2021-11-22"), ("Lose","2021-11-23"), ("Win","2021-11-25")]
2021-11-25  team6   team5   Win           [("Win","2021-11-23")]
2021-11-28  team3   team1   Lose          [("Win","2021-11-22"), ("Lose","2021-11-25")]
2021-11-29  team1   team5   Win           [("Win","2021-11-22"), ("Lose","2021-11-23"), ("Win","2021-11-25"), ("Lose","2021-11-25")]
2021-11-29  team6   team9   Win           [("Win","2021-11-23"), ("Win","2021-11-25")]

The issue is that the code takes a long time with a large dataset (more than 60,000 rows) and I need it to run very, very fast. Any idea how to make it faster or better version of it ?

Upvotes: 0

Views: 142

Answers (1)

BENY
BENY

Reputation: 323226

You can do egg tuple then cumsum

df['new'] = df[['match_date','home_result']].agg(tuple,1).groupby(df['home']).apply(lambda x : x.cumsum().shift())
0                                                   NaN
1                                                   NaN
2                                     (2021-11-22, Win)
3                                                   NaN
4                   (2021-11-22, Win, 2021-11-23, Lose)
5                                     (2021-11-22, Win)
6     (2021-11-22, Win, 2021-11-23, Lose, 2021-11-25...
7                                     (2021-11-23, Win)
8                   (2021-11-22, Win, 2021-11-25, Lose)
9     (2021-11-22, Win, 2021-11-23, Lose, 2021-11-25...
10                   (2021-11-23, Win, 2021-11-25, Win)
dtype: object

Upvotes: 3

Related Questions