Reputation: 17676
I want to impute the missing values per group
np.min
per indicatorKPI np.mean
per indicatorKPIfor states with missing values, I want to impute with the per indicatorKPI
mean. Here, this would mean to impute the missing values for Serbia
mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state','germany','serbia', 'austria', 'germany','serbia', 'austria',], 'indicatorKPI':[np.nan,np.nan,'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'], 'value':[np.nan,np.nan,0.9,np.nan,0.7, 0.2, 0.3, 0.6]})
The desired output should be similar to
mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
'germany','serbia','serbia', 'austria',
'germany','serbia', 'austria',],
'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'],
'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]
})
Upvotes: 3
Views: 2865
Reputation: 394003
Based on your new example df the following works for me:
In [185]:
mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
mydf
Out[185]:
Country indicatorKPI value
0 no-A-state SP.DYN.LE00.IN 0.200000
1 no-ISO-state NY.GDP.MKTP.CD 0.442857
2 no-A-state SP.DYN.LE00.IN 0.200000
3 no-ISO-state SP.DYN.LE00.IN 0.442857
4 germany NY.GDP.MKTP.CD 0.900000
5 serbia SP.DYN.LE00.IN 0.328571
6 serbia NY.GDP.MKTP.CD 0.700000
7 austria NY.GDP.MKTP.CD 0.585714
8 germany SP.DYN.LE00.IN 0.200000
9 serbia NY.GDP.MKTP.CD 0.300000
10 austria SP.DYN.LE00.IN 0.600000
Basically what this does is to fill the missing values for each condition, so we set the min for the 'no-A-state' countries, then mean for 'no-ISO-state' countries. We then groupby on 'indicatorKPI' and calc the mean for each group and assign again to the null value rows, the respective countries' mean using map
which performs a lookup
Here are the steps broken down:
In [187]:
mydf.groupby('indicatorKPI')['value'].mean()
Out[187]:
indicatorKPI
NY.GDP.MKTP.CD 0.633333
SP.DYN.LE00.IN 0.400000
Name: value, dtype: float64
In [188]:
mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
Out[188]:
0 0.400000
1 0.633333
2 0.400000
3 0.400000
4 0.633333
5 0.400000
6 0.633333
7 0.633333
8 0.400000
9 0.633333
10 0.400000
Name: indicatorKPI, dtype: float64
Upvotes: 3