Bowen Liu
Bowen Liu

Reputation: 1127

How to groupby and sum if the cell value of certain columns fit specific conditions

I feel like what I'm trying to do is quite basic but I can't seem to find a similar post here. Please let me know if my post is indeed is duplicate.

The data I have is about transportation crash incidents. The first two columns show the exact number of fatalities and injuries of the incident, but the 3rd and 4th columns (alcohol and cellphone related) only show binary values: 0 (meaning not related) and 1 (meaning related).

The example data is shown as below:

(Sorry the data is not aligned perfectly with the column head, I never knew how to format it correctly. Would appreciate it if anyone can share any tips.)

    NAME FATAL# INJURY# ALCOHOL CELL
0   City A  5   1   0   0
1   City B  5   1   0   1
2   City A  3   1   1   0
3   City B  3   1   1   0
4   City A  3   0   1   0
5   City B  2   2   0   0

What I want is to do a groupby sum on the first two column, and then calculate the sum of FATAL for each city when the ALCOHOL or CELLPHONE column cell value is 1.

So basically my desired output is:

   NAME FATAL# INJURY # ALCOHOL FATALCELL FATAL
0   City A  11  2   6   0
1   City B  10  4   3   5

Sorry for the bad formatting, the pictures of the aforementioned dataframes if it helps understanding:

What I have

Have

What I want

Want

I know for the first two columns I shall do df.groupby(['NAME']).['FATAL', 'INJURIES'].sum(). And as for the second part, I can do df1.groupby(['NAME','ALCOHOL_RELATED'])['FATAL_COUNT', 'INJURY_COUNT'].sum() but then I would lose the total count columns.

How shall I accomplish this?

Thanks.

Upvotes: 3

Views: 706

Answers (3)

Terry
Terry

Reputation: 2811

You can use np.where to create ALCOHOL_FATAL and CELL_FATAL columns.

import numpy as np

df['ALCOHOL_FATAL'] = np.where(df['ALCOHOL'] == 1, df['FATAL'], 0)
df['CELL_FATAL'] = np.where(df['CELL'] == 1, df['FATAL'], 0)

After, drop the columns ALCOHOL and CELL and groupby the data

df.drop(['ALCOHOL', 'CELL'], axis = 1, inplace=True)
df.groupby('NAME', as_index=False).sum()

    NAME    FATAL # INJURY #    ALCOHOL_FATAL   CELL_FATAL
0   CityA   11  2   6   0
1   CityB   10  4   3   5

Upvotes: 0

Brian
Brian

Reputation: 1604

In a single groupby:

df.groupby('NAME').agg(
lambda grp: [grp.loc[grp[col] > 0, 'FATAL#'].sum() if col != 'INJURY#' else grp[col].sum() 
             for col in grp if col != 'FATAL#'])


>>>df

NAME    FATAL#  INJURY# ALCOHOL CELL
A   11  2   6   0
B   10  4   3   5

Upvotes: 0

jpp
jpp

Reputation: 164613

Sometimes it is clearest to join additional series to your dataframe, then groupby:

df = pd.DataFrame({'NAME': ['CityA', 'CityB', 'CityA', 'CityB', 'CityA', 'CityB'],
                   'FATAL#': [5, 5, 3, 3, 3, 2],
                   'INJURY#': [1, 1, 1, 1, 0, 2],
                   'ALCOHOL': [0, 0, 1, 1, 1, 0],
                   'CELL': [0, 1, 0, 0, 0, 0]})

# construct fatals dataframe and join
fatals = df.iloc[:, -2:].mul(df['FATAL#'], axis=0).add_prefix('FATAL_')
df = df.join(fatals)

# define columns to sum and groupby
sum_cols = ['FATAL#', 'INJURY#'] + df.columns[-2:].tolist()
res = df.groupby('NAME')[sum_cols].sum().reset_index()

print(res)

    NAME  FATAL#  INJURY#  FATAL_ALCOHOL  FATAL_CELL
0  CityA      11        2              6           0
1  CityB      10        4              3           5

Upvotes: 2

Related Questions