luka
luka

Reputation: 529

Calculating average dataframe filtered by two columns

I have this Dataframe

    Unnamed: 0  Datetime    HomeTeam    AwayTeam    Ball PossessionMatch_H  Ball PossessionMatch_A
0   0   2021-05-24 02:30:00 U. De Chile Everton                         68      32
1   1   2021-05-23 21:00:00 Huachipato  Colo Colo                       48      52
2   2   2021-05-23 18:30:00 Melipilla   Antofagasta                     47      53
3   3   2021-05-23 02:30:00 U. Espanola U. Catolica                     37      63
4   4   2021-05-23 00:00:00 S. Wanderers    O'Higgins                   29      71
... ... ... ... ... ... ...
57  57  2021-03-28 15:45:00 Palestino   Antofagasta                     58      42
58  58  2021-03-28 01:00:00 U. Espanola S. Wanderers                    50      50
59  59  2021-03-27 22:30:00 Colo Colo   Union La Calera                 58      42
60  60  2021-03-27 20:00:00 Everton O'Higgins                           54      46
61  61  2021-03-27 15:00:00 Curico Unido    Melipilla                   41      59

I want to split it in multiple dataframes and apply two criteria in "HomeTeam" and "AwayTeam", then calculate the average of Ball Possession and put it in a new column "Ball PossessionMatch_H/MP" if the team is in "HomeTeam" and "Ball PossessionMatch_A/MP" if the team is in "AwayTeam"

Code:

teams = pd.unique(df[['HomeTeam', 'AwayTeam']].values.ravel('K'))
df_list3 = []
for team in teams:
    if team in df['HomeTeam']:
        current_df = df[(df.HomeTeam == team) | (df.AwayTeam == team)].copy()
        current_df["Ball PossessionMatch_H/MP"] = current_df.apply(
            lambda x: (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].sum()
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].sum()
            )
            / (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].size
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].size
            ),
            axis=1,
        )
    else:
        current_df = df[(df.HomeTeam == team) | (df.AwayTeam == team)].copy()
        current_df["Ball PossessionMatch_A/MP"] = current_df.apply(
            lambda x: (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].sum()
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].sum()
            )
            / (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].size
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].size
            ),
            axis=1,
        )
        df_list3.append(current_df)

df = pd.concat(df_list3)
print(df)

It does not consider the if and create only "Ball PossessionMatch_A/MP" column with calculated values and duplicates the rows, I would append the results to the original dataframe.

Expected Result:

    Unnamed: 0  Datetime    HomeTeam    AwayTeam    Ball PossessionMatch_H  Ball PossessionMatch_A  Ball PossessionMatch_H/MP   Ball PossessionMatch_A/MP
0   0   2021-05-24 02:30:00 U. De Chile Everton               68           32         50.33      43.5

Upvotes: 0

Views: 61

Answers (2)

EyeOfAgamotto
EyeOfAgamotto

Reputation: 71

How about the following?

hometeam_count = df.groupby("HomeTeam")["Ball PossessionMatch_H"].count()
hometeam_sum = df.groupby("HomeTeam")["Ball PossessionMatch_H"].sum()
awayteam_count = df.groupby("AwayTeam")["Ball PossessionMatch_A"].count()
awayteam_sum = df.groupby("AwayTeam")["Ball PossessionMatch_A"].sum()

df["Ball PossessionMatch_H/MP"] = df["HomeTeam"].apply(lambda x: ((hometeam_sum.loc[x] if x in hometeam_sum.index else 0) + (awayteam_sum.loc[x] if x in awayteam_sum.index else 0)) / ((hometeam_count.loc[x] if x in hometeam_count.index else 0)  + (awayteam_count.loc[x] if x in awayteam_count.index else 0)))
df["Ball PossessionMatch_A/MP"] = df["AwayTeam"].apply(lambda x: ((hometeam_sum.loc[x] if x in hometeam_sum.index else 0) + (awayteam_sum.loc[x] if x in awayteam_sum.index else 0)) / ((hometeam_count.loc[x] if x in hometeam_count.index else 0)  + (awayteam_count.loc[x] if x in awayteam_count.index else 0)))

The code above utilises the groupby function to calculate the sum and the count for each team and for both Ball PossessionMatch_A and Ball PossessionMatch_H separately.
The second part of the code basically looking up (loc) the corresponding sum and count values for each HomeTeam and AwayTeam respectively, and perform the average calculation as intended.

As you can tell, the code for both HomeTeam and AwayTeam are pretty much identical which also be cleaned up so that we Dont' repeat yourself. I presented as such purely for readability purpose.

Although this does not solve the problem (as to why the values are not filled in the columns that you have created), I am providing an alternative solution to the question, which I think is more intuitive and more readable.

Upvotes: 1

deponovo
deponovo

Reputation: 1432

Disclaimer: Just helping you to progress in some direction, not providing a copy/paste solution.

I edited a bit your dataframe for demonstration purposes:

import pandas as pd

df = pd.DataFrame(
    {
     "Datetime": [pd.Timestamp(x) for x in ["2021-05-24 02:30:00", "2021-05-23 21:00:00", 
                                            "2021-05-23 18:30:00", "2021-05-23 02:30:00", 
                                            "2021-05-23 00:00:00"]],
     "HomeTeam": ["Huachipato", "Huachipato", "Huachipato", "U. Espanola", "U. Espanola"],
     "AwayTeam": ["Everton", "Colo Colo", "Antofagasta", "Everton",  "Colo Colo"],
     "Ball PossessionMatch_A": [68, 48, 47, 37, 29],
    }     
)


print(df.groupby("HomeTeam")["Ball PossessionMatch_A"].mean())

The output is:

HomeTeam
Huachipato     54.333333
U. Espanola    33.000000
Name: Ball PossessionMatch_A, dtype: float64

Now if you need to filter on some other information, say for a given "AwayTeam" such as "Everton", you could:

print(df[df["AwayTeam"].eq("Everton")].groupby("HomeTeam")["Ball PossessionMatch_A"].mean())

and then you would get:

HomeTeam
Huachipato     68
U. Espanola    37
Name: Ball PossessionMatch_A, dtype: int64

Upvotes: 0

Related Questions