Reputation: 3
I have a dataset with a lot of columns and I already have filtered the ones that I need:
import pandas as pd
data = pd.read_csv("./data/GL1871.txt", header=None, usecols=[3,6,9,10])
new_data = data.rename(columns={3: 'Away', 6: 'Home', 9: 'Away runs', 10: 'Home runs'})
What I want to get out of this dataframe is the sum of Column 'Away runs' and 'Home runs' for each team. The output should look something like this:
0 CL1 364
1 BS1 254
...
9 CH1 190
So far I have tried the groupby() method and the output is not what I really need:
runs_away = new_data.groupby( 'Home')['Away runs'].sum()
runs_away = new_data.groupby( 'Home')['Home runs'].sum()
Home
BS1 165
CH1 139
CL1 162
FW1 112
NY2 127
PH1 124
RC1 66
TRO 231
WS3 120
Name: Away runs, dtype: int64
Home
BS1 223
CH1 197
CL1 119
FW1 78
NY2 178
PH1 180
RC1 72
TRO 200
WS3 166
Name: Home runs, dtype: int64
Is there any smart way to do this and getting both values at the same time? Maybe a comprehension would be better, but I don't know how to iterate a Dataframe. Thank you in advance.
Also, the expected result is to get the sum of the runs in Away runs and Home runs for the same team. So sum Away runs where the Away is team x + Home runs where Home is team x as well
Adding sample from the dataset:
Away Home Away runs Home runs
0 CL1 FW1 0 2
1 BS1 WS3 20 18
2 CL1 RC1 12 4
3 CL1 CH1 12 14
4 BS1 TRO 9 5
Desired output:
0 CL1 364
1 BS1 254
...
9 CH1 190
#Where 364 is the sum of all runs of the team CL1 wether it was away or home
Away Home Away runs Home runs
0 CL1 FW1 0 2
1 BS1 WS3 20 18
2 CL1 RC1 12 4
3 CL1 CH1 12 14
4 BS1 TRO 9 5
5 CH1 CL1 18 10
6 WS3 CL1 12 8
7 CH1 FW1 14 5
8 WS3 FW1 6 12
9 TRO BS1 29 14
10 WS3 CH1 4 14
Upvotes: 0
Views: 117
Reputation: 13407
Pretty sure I understand what you're trying to do. You'll need to do some reshaping of your data to stack the teams and runs so that they're in 2 long columns. Then you can perform a groupby operation to get the total number of runs per team.
away = df[["Away", "Away runs"]]
home = df[["Home", "Home runs"]]
new_df = pd.DataFrame(np.vstack([home, away]), columns=["team", "runs"])
print(new_df)
team runs
0 FW1 2
1 WS3 18
2 RC1 4
3 CH1 14
4 TRO 5
5 CL1 10
6 CL1 8
7 FW1 5
8 FW1 12
9 BS1 14
10 CH1 14
11 CL1 0
12 BS1 20
13 CL1 12
14 CL1 12
15 BS1 9
16 CH1 18
17 WS3 12
18 CH1 14
19 WS3 6
20 TRO 29
21 WS3 4
Now that the teams are all in a single column, and their corresponding runs in the other. We can perform a simple groupby
operation to calculate the total number of runs per team, ignoring whether they were "home"
or "away"
team_runs = new_df.groupby("team", as_index=False)["runs"].sum()
print(team_runs)
team runs
0 BS1 43
1 CH1 60
2 CL1 42
3 FW1 19
4 RC1 4
5 TRO 34
6 WS3 40
Upvotes: 0
Reputation: 11
It depends on how the dataframe is structured,but if the run_aways are pandas series, you can simply add them. You could also do it in one line like :
runs = new_data.groupby( 'Home')['Away runs'].sum() + new_data.groupby( 'Home')['Home runs'].sum()
You could also use a mask and the .apply method but I find it les readable.
Upvotes: 1