Reputation: 138
I notice several 'set value of new column based on value of another'-type questions, but from what I gather, I have not found that they address dividing values in the same column, based on the conditions set by another column.
The data I have is as the table below, minus the column (variable) 'healthpertotal'.
I want to make a new column that shows the percent of government health spending over its total spending, for a given year, as shown below in the column 'healthpertotal'.
So for instance, in 1995, the value of this variable is (42587(health spending amount)/326420(total spending amount))*100=13.05.
As for the rows showing total spending, the 'healthpertotal' could be 'missing', 1, or 'not applicable' and the like. I am ok with any of these options.
How would I set up this new column 'healthpertotal' using python?
A proposed table or DataFrame for what I would like to achieve follows (and its code on how it might be set up - artificially 'forced' in the case of the final variable 'healthpertotal') :
data = {'function':['Health'] * 3 + ['Total'] * 3,
'year':[1995,1996,1997,1995,1996,1997],
'value':[42587, 44209,44472,326420,333637,340252],
'healthpertotal':[13.05,13.25,13.07]+[np.nan]*3
}
df = pd.DataFrame(data)
print (df)
Expected outcome:
function year value healthpertotal
0 Health 1995 42587 13.05
1 Health 1996 44209 13.25
2 Health 1997 44472 13.07
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN
Upvotes: 0
Views: 292
Reputation:
You could use groupby
+ transform last
to transform total values to align with the DataFrame; then divide "value" with it using rdiv
; then replace 100 with NaN (assuming health spending is never 100%):
df['healthpertotal'] = df.groupby('year')['value'].transform('last').rdiv(df['value']).mul(100).replace(100, np.nan)
We could also use merge
+ concat
(calculate the percentage in between these operations):
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year')
tmp['healthpertotal'] = tmp['value_x'] / tmp['value_y'] * 100
msk = tmp.columns.str.contains('_y')
tmp1 = tmp.loc[:, ~msk]
tmp2 = tmp[tmp.columns[msk].tolist() + ['year']]
pd.concat((tmp1.set_axis(tmp1.columns.map(lambda x: x.split('_')[0]), axis=1),
tmp2.set_axis(tmp2.columns.map(lambda x: x.split('_')[0]), axis=1)))
We could also use merge
+ wide_to_long
(calculate the percentage in between these operations) + mask
the duplicates:
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year', suffixes=('0','1'))
tmp['healthpertotal'] = tmp['value0'] / tmp['value1'] * 100
out = pd.wide_to_long(tmp, stubnames=['function', 'value'], i=['year','healthpertotal'], j='').droplevel(-1).reset_index()
out['healthpertotal'] = out['healthpertotal'].mask(out['healthpertotal'].duplicated())
Output:
function year value healthpertotal
0 Health 1995 42587 13.046688
1 Health 1996 44209 13.250629
2 Health 1997 44472 13.070313
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN
Upvotes: 1