Reputation: 27
I am working with the below dataset and I am running into trouble calculating total score based on team id. A team can be either home or away and I am looking to calculate a running total of their score.
I have successfully created a running total based on home_id and away_id as well as a running average based on home/away id, but I am struggling to calculate based on both columns
For example if in game 1 home team scores 1 and then in game 2 they are the away team and score 3 I would like to create a column that says they have scored 4 in total up to this point in the dataset
My code so far is:
import pandas as pd
game_data = pd.read_csv('game_data.csv')
game_data['home_avg_home_games'] = game_data.groupby('home_id')['home_score'].transform(lambda x: x.rolling(165, min_periods = 0).mean())
game_data['home_avg_against_home_games'] = game_data.groupby('home_id')['away_score'].transform(lambda x: x.rolling(165, min_periods = 0).mean())
game_data['away_avg_away_games'] = game_data.groupby('away_id')['away_score'].transform(lambda x: x.rolling(165, min_periods = 0).mean())
game_data['away_avg_against_away_games'] = game_data.groupby('away_id')['home_score'].transform(lambda x: x.rolling(165, min_periods = 0).mean())
game_data['scored_home_total'] = game_data.groupby('home_id')['home_score'].cumsum()
game_data['scored_away_total'] = game_data.groupby('away_id')['away_score'].cumsum()
game_id | away_id | home_id | away_score | home_score | home_avg_home_games | home_avg_against_home_games | away_avg_away_games | away_avg_against_away_games |
---|---|---|---|---|---|---|---|---|
446877 | 138 | 134 | 1 | 4 | 4 | 1 | 1 | 4 |
446911 | 141 | 139 | 5 | 3 | 3 | 5 | 5 | 3 |
446873 | 121 | 118 | 3 | 4 | 4 | 3 | 3 | 4 |
446875 | 137 | 158 | 12 | 3 | 3 | 12 | 12 | 3 |
446872 | 142 | 110 | 2 | 3 | 3 | 2 | 2 | 3 |
446876 | 136 | 140 | 2 | 3 | 3 | 2 | 2 | 3 |
446874 | 143 | 113 | 2 | 6 | 6 | 2 | 2 | 6 |
446879 | 120 | 144 | 4 | 3 | 3 | 4 | 4 | 3 |
446871 | 119 | 135 | 15 | 0 | 0 | 15 | 15 | 0 |
446878 | 141 | 139 | 5 | 3 | 3 | 5 | 5 | 3 |
446869 | 115 | 109 | 10 | 5 | 5 | 10 | 10 | 5 |
446889 | 112 | 108 | 9 | 0 | 0 | 9 | 9 | 0 |
446868 | 145 | 133 | 4 | 3 | 3 | 4 | 4 | 3 |
446870 | 117 | 147 | 5 | 3 | 3 | 5 | 5 | 3 |
446867 | 111 | 114 | 6 | 2 | 2 | 6 | 6 | 2 |
446896 | 121 | 118 | 2 | 0 | 2 | 2.5 | 2.5 | 2 |
446910 | 138 | 134 | 5 | 6 | 5 | 3 | 3 | 5 |
446887 | 141 | 139 | 2 | 3 | 3 | 4 | 4 | 3 |
446883 | 116 | 146 | 8 | 7 | 7 | 8 | 8 | 7 |
446886 | 136 | 140 | 10 | 2 | 2.5 | 6 | 6 | 2.5 |
446885 | 137 | 158 | 2 | 1 | 2 | 7 | 7 | 2 |
446882 | 115 | 109 | 6 | 11 | 8 | 8 | 8 | 8 |
446880 | 112 | 108 | 6 | 1 | 0.5 | 7.5 | 7.5 | 0.5 |
446881 | 145 | 133 | 5 | 4 | 3.5 | 4.5 | 4.5 | 3.5 |
446884 | 119 | 135 | 3 | 0 | 0 | 9 | 9 | 0 |
446901 | 141 | 139 | 3 | 5 | 3.5 | 3.75 | 3.75 | 3.5 |
446898 | 137 | 158 | 3 | 4 | 2.666666667 | 5.666666667 | 5.666666667 | 2.666666667 |
446899 | 136 | 140 | 9 | 5 | 3.333333333 | 7 | 7 | 3.333333333 |
446891 | 115 | 109 | 4 | 3 | 6.333333333 | 6.666666667 | 6.666666667 | 6.333333333 |
My desired output is:
game_id | away_id | home_id | away_score | home_score | home_avg_for_home_games | home_avg_against_home_games | away_avg_away_games | away_avg_against_away_games | home_total_score | away_total_score |
---|---|---|---|---|---|---|---|---|---|---|
446877 | 1 | 2 | 1 | 4 | 4 | 1 | 1 | 4 | 4 | 1 |
446911 | 2 | 3 | 5 | 3 | 3 | 5 | 5 | 3 | 3 | 5 |
446873 | 1 | 3 | 3 | 4 | 3.5 | 4 | 2 | 4 | 7 | 4 |
Upvotes: 1
Views: 122
Reputation: 1129
This will help you to solve your problem. However it gives warning for avoid warning u need to use include_groups=False
however if u that it remove home_id
and away_id
game_data = game_data.groupby('home_id',group_keys=False,as_index=False).apply(lambda x: x.assign(home_total_score=x['home_score'].cumsum()),include_groups=True)
game_data= game_data.groupby('away_id',group_keys=False,as_index=False).apply(lambda x: x.assign(away_total_score=x['away_score'].cumsum()),include_groups=True)
I have solved that issue with like that I created new away_id with using np.where()
. So you need to add import numpy as np
in your code.
new = game_data.groupby('home_id',group_keys=False,as_index=False).apply(lambda x: x.assign(home_total_score=x['home_score'].cumsum()),include_groups=False)
new= game_data.groupby('away_id',group_keys=False,as_index=False).apply(lambda x: x.assign(away_total_score=x['away_score'].cumsum()),include_groups=False)
new["away_id"] = np.where(new["game_id"] == game_data["game_id"],game_data["away_id"],None)
For reindex columns add these line codes. Becasue we created new column and it added into last index.
columns = new.columns.tolist()
columns.insert(columns.index('away_score'), columns.pop(columns.index('away_id')))
new = new.reindex(columns=columns)
Upvotes: 1