Reputation: 529
I have this Dataframe
Unnamed: 0 Datetime HomeTeam AwayTeam Ball PossessionMatch_H Ball PossessionMatch_A
0 0 2021-05-24 02:30:00 U. De Chile Everton 68 32
1 1 2021-05-23 21:00:00 Huachipato Colo Colo 48 52
2 2 2021-05-23 18:30:00 Melipilla Antofagasta 47 53
3 3 2021-05-23 02:30:00 U. Espanola U. Catolica 37 63
4 4 2021-05-23 00:00:00 S. Wanderers O'Higgins 29 71
... ... ... ... ... ... ...
57 57 2021-03-28 15:45:00 Palestino Antofagasta 58 42
58 58 2021-03-28 01:00:00 U. Espanola S. Wanderers 50 50
59 59 2021-03-27 22:30:00 Colo Colo Union La Calera 58 42
60 60 2021-03-27 20:00:00 Everton O'Higgins 54 46
61 61 2021-03-27 15:00:00 Curico Unido Melipilla 41 59
I want to split it in multiple dataframes and apply two criteria in "HomeTeam" and "AwayTeam", then calculate the average of Ball Possession and put it in a new column "Ball PossessionMatch_H/MP" if the team is in "HomeTeam" and "Ball PossessionMatch_A/MP" if the team is in "AwayTeam"
Code:
teams = pd.unique(df[['HomeTeam', 'AwayTeam']].values.ravel('K'))
df_list3 = []
for team in teams:
if team in df['HomeTeam']:
current_df = df[(df.HomeTeam == team) | (df.AwayTeam == team)].copy()
current_df["Ball PossessionMatch_H/MP"] = current_df.apply(
lambda x: (
current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].sum()
+ current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].sum()
)
/ (
current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].size
+ current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].size
),
axis=1,
)
else:
current_df = df[(df.HomeTeam == team) | (df.AwayTeam == team)].copy()
current_df["Ball PossessionMatch_A/MP"] = current_df.apply(
lambda x: (
current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].sum()
+ current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].sum()
)
/ (
current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].size
+ current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].size
),
axis=1,
)
df_list3.append(current_df)
df = pd.concat(df_list3)
print(df)
It does not consider the if
and create only "Ball PossessionMatch_A/MP" column with calculated values and duplicates the rows, I would append
the results to the original dataframe.
Expected Result:
Unnamed: 0 Datetime HomeTeam AwayTeam Ball PossessionMatch_H Ball PossessionMatch_A Ball PossessionMatch_H/MP Ball PossessionMatch_A/MP
0 0 2021-05-24 02:30:00 U. De Chile Everton 68 32 50.33 43.5
Upvotes: 0
Views: 61
Reputation: 71
How about the following?
hometeam_count = df.groupby("HomeTeam")["Ball PossessionMatch_H"].count()
hometeam_sum = df.groupby("HomeTeam")["Ball PossessionMatch_H"].sum()
awayteam_count = df.groupby("AwayTeam")["Ball PossessionMatch_A"].count()
awayteam_sum = df.groupby("AwayTeam")["Ball PossessionMatch_A"].sum()
df["Ball PossessionMatch_H/MP"] = df["HomeTeam"].apply(lambda x: ((hometeam_sum.loc[x] if x in hometeam_sum.index else 0) + (awayteam_sum.loc[x] if x in awayteam_sum.index else 0)) / ((hometeam_count.loc[x] if x in hometeam_count.index else 0) + (awayteam_count.loc[x] if x in awayteam_count.index else 0)))
df["Ball PossessionMatch_A/MP"] = df["AwayTeam"].apply(lambda x: ((hometeam_sum.loc[x] if x in hometeam_sum.index else 0) + (awayteam_sum.loc[x] if x in awayteam_sum.index else 0)) / ((hometeam_count.loc[x] if x in hometeam_count.index else 0) + (awayteam_count.loc[x] if x in awayteam_count.index else 0)))
The code above utilises the groupby
function to calculate the sum
and the count
for each team and for both Ball PossessionMatch_A
and Ball PossessionMatch_H
separately.
The second part of the code basically looking up (loc
) the corresponding sum
and count
values for each HomeTeam
and AwayTeam
respectively, and perform the average calculation as intended.
As you can tell, the code for both HomeTeam
and AwayTeam
are pretty much identical which also be cleaned up so that we Dont' repeat yourself. I presented as such purely for readability purpose.
Although this does not solve the problem (as to why the values are not filled in the columns that you have created), I am providing an alternative solution to the question, which I think is more intuitive and more readable.
Upvotes: 1
Reputation: 1432
Disclaimer: Just helping you to progress in some direction, not providing a copy/paste solution.
I edited a bit your dataframe for demonstration purposes:
import pandas as pd
df = pd.DataFrame(
{
"Datetime": [pd.Timestamp(x) for x in ["2021-05-24 02:30:00", "2021-05-23 21:00:00",
"2021-05-23 18:30:00", "2021-05-23 02:30:00",
"2021-05-23 00:00:00"]],
"HomeTeam": ["Huachipato", "Huachipato", "Huachipato", "U. Espanola", "U. Espanola"],
"AwayTeam": ["Everton", "Colo Colo", "Antofagasta", "Everton", "Colo Colo"],
"Ball PossessionMatch_A": [68, 48, 47, 37, 29],
}
)
print(df.groupby("HomeTeam")["Ball PossessionMatch_A"].mean())
The output is:
HomeTeam
Huachipato 54.333333
U. Espanola 33.000000
Name: Ball PossessionMatch_A, dtype: float64
Now if you need to filter on some other information, say for a given "AwayTeam" such as "Everton", you could:
print(df[df["AwayTeam"].eq("Everton")].groupby("HomeTeam")["Ball PossessionMatch_A"].mean())
and then you would get:
HomeTeam
Huachipato 68
U. Espanola 37
Name: Ball PossessionMatch_A, dtype: int64
Upvotes: 0