Reputation: 111
I need get mean of rows, but with date before the current row date.
I have this code, but is it takes a long time in datasets with 50k rows:
import pandas as pd
data = {
'id': [1,2,3,4,5],
'home_goals': [1,0,3,1,2],
'away_goals': [1,1,2,0,1],
'home_name': ['a','b','a','b','a'],
'away_name': ['b','a','b','a','b'],
'date': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05'],
}
df = pd.DataFrame(data=data)
for i, row in df.iterrows():
rows_before_date = df[df['date'] < row['date']]
home_in_home_matches = rows_before_date[rows_before_date['home_name'] == row['home_name']]
home_in_away_matches = rows_before_date[rows_before_date['home_name'] == row['away_name']]
away_in_away_matches = rows_before_date[rows_before_date['away_name'] == row['away_name']]
away_in_home_matches = rows_before_date[rows_before_date['away_name'] == row['home_name']]
if any(v == 0 for v in [len(home_in_home_matches), len(home_in_away_matches), len(away_in_away_matches), len(away_in_home_matches)]): continue
df.loc[i, 'home_scored_at_home_mean'] = home_in_home_matches['home_goals'].sum() / len(home_in_home_matches)
df.loc[i, 'away_scored_at_away_mean'] = away_in_away_matches['away_goals'].sum() / len(away_in_away_matches)
df.loc[i, 'home_scored_at_away_mean'] = home_in_away_matches['away_goals'].sum() / len(home_in_away_matches)
df.loc[i, 'away_scored_at_home_mean'] = away_in_home_matches['home_goals'].sum() / len(away_in_home_matches)
I wanted to know if it is possible to make a more optimized and readable code.
Upvotes: 1
Views: 57
Reputation: 260600
Assuming prior sorting of the dates, you can use a shifted expanding.mean
per group:
df['home_mean'] = (df.groupby('home_name')['home_goals']
.apply(lambda s: s.expanding().mean().shift())
)
df['away_mean'] = (df.groupby('away_name')['away_goals']
.apply(lambda s: s.expanding().mean().shift())
)
output:
id home_goals away_goals home_name away_name date home_mean away_mean
0 1 1 1 a b 2020-01-01 NaN NaN
1 2 0 1 b a 2020-01-02 NaN NaN
2 3 3 2 a b 2020-01-03 1.0 1.0
3 4 1 0 b a 2020-01-04 0.0 1.0
4 5 2 1 a b 2020-01-05 2.0 1.5
Upvotes: 3