Irisoft Education
Irisoft Education

Reputation: 3

How do I use Pandas to get the sum of 2 columns based on the value of another column in a dataframe

I have a dataset with a lot of columns and I already have filtered the ones that I need:

import pandas as pd

data = pd.read_csv("./data/GL1871.txt", header=None, usecols=[3,6,9,10])
new_data = data.rename(columns={3: 'Away', 6: 'Home', 9: 'Away runs', 10: 'Home runs'})

What I want to get out of this dataframe is the sum of Column 'Away runs' and 'Home runs' for each team. The output should look something like this:

0  CL1 364
1  BS1 254
...
9  CH1 190

So far I have tried the groupby() method and the output is not what I really need:

runs_away = new_data.groupby( 'Home')['Away runs'].sum()
runs_away = new_data.groupby( 'Home')['Home runs'].sum()

Home
BS1    165
CH1    139
CL1    162
FW1    112
NY2    127
PH1    124
RC1     66
TRO    231
WS3    120
Name: Away runs, dtype: int64
Home
BS1    223
CH1    197
CL1    119
FW1     78
NY2    178
PH1    180
RC1     72
TRO    200
WS3    166
Name: Home runs, dtype: int64

Is there any smart way to do this and getting both values at the same time? Maybe a comprehension would be better, but I don't know how to iterate a Dataframe. Thank you in advance.

Also, the expected result is to get the sum of the runs in Away runs and Home runs for the same team. So sum Away runs where the Away is team x + Home runs where Home is team x as well

Adding sample from the dataset:

    Away Home  Away runs  Home runs
0    CL1  FW1          0          2
1    BS1  WS3         20         18
2    CL1  RC1         12          4
3    CL1  CH1         12         14
4    BS1  TRO          9          5

Desired output:

0  CL1 364 
1  BS1 254
...
9  CH1 190

#Where 364 is the sum of all runs of the team CL1 wether it was away or home

       Away Home  Away runs  Home runs
0   CL1  FW1          0          2
1   BS1  WS3         20         18
2   CL1  RC1         12          4
3   CL1  CH1         12         14
4   BS1  TRO          9          5
5   CH1  CL1         18         10
6   WS3  CL1         12          8
7   CH1  FW1         14          5
8   WS3  FW1          6         12
9   TRO  BS1         29         14
10  WS3  CH1          4         14

Upvotes: 0

Views: 117

Answers (2)

Cameron Riddell
Cameron Riddell

Reputation: 13407

Pretty sure I understand what you're trying to do. You'll need to do some reshaping of your data to stack the teams and runs so that they're in 2 long columns. Then you can perform a groupby operation to get the total number of runs per team.

away = df[["Away", "Away runs"]]
home = df[["Home", "Home runs"]]

new_df = pd.DataFrame(np.vstack([home, away]), columns=["team", "runs"])

print(new_df)
   team runs
0   FW1    2
1   WS3   18
2   RC1    4
3   CH1   14
4   TRO    5
5   CL1   10
6   CL1    8
7   FW1    5
8   FW1   12
9   BS1   14
10  CH1   14
11  CL1    0
12  BS1   20
13  CL1   12
14  CL1   12
15  BS1    9
16  CH1   18
17  WS3   12
18  CH1   14
19  WS3    6
20  TRO   29
21  WS3    4

Now that the teams are all in a single column, and their corresponding runs in the other. We can perform a simple groupby operation to calculate the total number of runs per team, ignoring whether they were "home" or "away"

team_runs = new_df.groupby("team", as_index=False)["runs"].sum()

print(team_runs)
  team  runs
0  BS1    43
1  CH1    60
2  CL1    42
3  FW1    19
4  RC1     4
5  TRO    34
6  WS3    40

Upvotes: 0

HotMailRob
HotMailRob

Reputation: 11

It depends on how the dataframe is structured,but if the run_aways are pandas series, you can simply add them. You could also do it in one line like :

runs = new_data.groupby( 'Home')['Away runs'].sum() + new_data.groupby( 'Home')['Home runs'].sum()

You could also use a mask and the .apply method but I find it les readable.

Upvotes: 1

Related Questions