Reputation: 1127
I feel like what I'm trying to do is quite basic but I can't seem to find a similar post here. Please let me know if my post is indeed is duplicate.
The data I have is about transportation crash incidents. The first two columns show the exact number of fatalities and injuries of the incident, but the 3rd and 4th columns (alcohol and cellphone related) only show binary values: 0 (meaning not related) and 1 (meaning related).
The example data is shown as below:
(Sorry the data is not aligned perfectly with the column head, I never knew how to format it correctly. Would appreciate it if anyone can share any tips.)
NAME FATAL# INJURY# ALCOHOL CELL
0 City A 5 1 0 0
1 City B 5 1 0 1
2 City A 3 1 1 0
3 City B 3 1 1 0
4 City A 3 0 1 0
5 City B 2 2 0 0
What I want is to do a groupby sum on the first two column, and then calculate the sum of FATAL
for each city when the ALCOHOL
or CELLPHONE
column cell value is 1.
So basically my desired output is:
NAME FATAL# INJURY # ALCOHOL FATALCELL FATAL
0 City A 11 2 6 0
1 City B 10 4 3 5
Sorry for the bad formatting, the pictures of the aforementioned dataframes if it helps understanding:
What I have
What I want
I know for the first two columns I shall do df.groupby(['NAME']).['FATAL', 'INJURIES'].sum()
. And as for the second part, I can do df1.groupby(['NAME','ALCOHOL_RELATED'])['FATAL_COUNT', 'INJURY_COUNT'].sum()
but then I would lose the total count columns.
How shall I accomplish this?
Thanks.
Upvotes: 3
Views: 706
Reputation: 2811
You can use np.where
to create ALCOHOL_FATAL
and CELL_FATAL
columns.
import numpy as np
df['ALCOHOL_FATAL'] = np.where(df['ALCOHOL'] == 1, df['FATAL'], 0)
df['CELL_FATAL'] = np.where(df['CELL'] == 1, df['FATAL'], 0)
After, drop the columns ALCOHOL
and CELL
and groupby the data
df.drop(['ALCOHOL', 'CELL'], axis = 1, inplace=True)
df.groupby('NAME', as_index=False).sum()
NAME FATAL # INJURY # ALCOHOL_FATAL CELL_FATAL
0 CityA 11 2 6 0
1 CityB 10 4 3 5
Upvotes: 0
Reputation: 1604
In a single groupby:
df.groupby('NAME').agg(
lambda grp: [grp.loc[grp[col] > 0, 'FATAL#'].sum() if col != 'INJURY#' else grp[col].sum()
for col in grp if col != 'FATAL#'])
>>>df
NAME FATAL# INJURY# ALCOHOL CELL
A 11 2 6 0
B 10 4 3 5
Upvotes: 0
Reputation: 164613
Sometimes it is clearest to join additional series to your dataframe, then groupby
:
df = pd.DataFrame({'NAME': ['CityA', 'CityB', 'CityA', 'CityB', 'CityA', 'CityB'],
'FATAL#': [5, 5, 3, 3, 3, 2],
'INJURY#': [1, 1, 1, 1, 0, 2],
'ALCOHOL': [0, 0, 1, 1, 1, 0],
'CELL': [0, 1, 0, 0, 0, 0]})
# construct fatals dataframe and join
fatals = df.iloc[:, -2:].mul(df['FATAL#'], axis=0).add_prefix('FATAL_')
df = df.join(fatals)
# define columns to sum and groupby
sum_cols = ['FATAL#', 'INJURY#'] + df.columns[-2:].tolist()
res = df.groupby('NAME')[sum_cols].sum().reset_index()
print(res)
NAME FATAL# INJURY# FATAL_ALCOHOL FATAL_CELL
0 CityA 11 2 6 0
1 CityB 10 4 3 5
Upvotes: 2