Calculations on a pandas DataFrame column conditional on another column

I notice several 'set value of new column based on value of another'-type questions, but from what I gather, I have not found that they address dividing values in the same column, based on the conditions set by another column.

The data I have is as the table below, minus the column (variable) 'healthpertotal'.

  1. It shows (in the column 'function'), the amount of government spending (aka expenditure) on
    a) health (column 'value'), and
    b) its total spending (same column 'value'), and
  2. the associated year of that spending (column 'year').

I want to make a new column that shows the percent of government health spending over its total spending, for a given year, as shown below in the column 'healthpertotal'.

So for instance, in 1995, the value of this variable is (42587(health spending amount)/326420(total spending amount))*100=13.05.

As for the rows showing total spending, the 'healthpertotal' could be 'missing', 1, or 'not applicable' and the like. I am ok with any of these options.

How would I set up this new column 'healthpertotal' using python?

A proposed table or DataFrame for what I would like to achieve follows (and its code on how it might be set up - artificially 'forced' in the case of the final variable 'healthpertotal') :

data = {'function':['Health'] * 3 + ['Total'] * 3,
        'year':[1995,1996,1997,1995,1996,1997],
        'value':[42587, 44209,44472,326420,333637,340252],
        'healthpertotal':[13.05,13.25,13.07]+[np.nan]*3
        }

df = pd.DataFrame(data)

print (df)

Expected outcome:

  function  year   value  healthpertotal
0   Health  1995   42587           13.05
1   Health  1996   44209           13.25
2   Health  1997   44472           13.07
3    Total  1995  326420             NaN
4    Total  1996  333637             NaN
5    Total  1997  340252             NaN

Upvotes: 0

Views: 292

Answers (1)

user7864386
user7864386

Reputation:

You could use groupby + transform last to transform total values to align with the DataFrame; then divide "value" with it using rdiv; then replace 100 with NaN (assuming health spending is never 100%):

df['healthpertotal'] = df.groupby('year')['value'].transform('last').rdiv(df['value']).mul(100).replace(100, np.nan)

We could also use merge + concat (calculate the percentage in between these operations):

tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year')
tmp['healthpertotal'] = tmp['value_x'] / tmp['value_y'] * 100
msk = tmp.columns.str.contains('_y')
tmp1 = tmp.loc[:, ~msk]
tmp2 = tmp[tmp.columns[msk].tolist() + ['year']]
pd.concat((tmp1.set_axis(tmp1.columns.map(lambda x: x.split('_')[0]), axis=1), 
           tmp2.set_axis(tmp2.columns.map(lambda x: x.split('_')[0]), axis=1)))

We could also use merge + wide_to_long (calculate the percentage in between these operations) + mask the duplicates:

tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year', suffixes=('0','1'))
tmp['healthpertotal'] = tmp['value0'] / tmp['value1'] * 100
out = pd.wide_to_long(tmp, stubnames=['function', 'value'], i=['year','healthpertotal'], j='').droplevel(-1).reset_index()
out['healthpertotal'] = out['healthpertotal'].mask(out['healthpertotal'].duplicated())

Output:

  function  year   value  healthpertotal
0   Health  1995   42587       13.046688
1   Health  1996   44209       13.250629
2   Health  1997   44472       13.070313
3    Total  1995  326420             NaN
4    Total  1996  333637             NaN
5    Total  1997  340252             NaN

Upvotes: 1

Related Questions