Pandas per group imputation of missing values

Question

How can I achieve such a per-country imputation for each indicator in pandas?

I want to impute the missing values per group

no-A-state should get np.min per indicatorKPI
no-ISO-state should get the np.mean per indicatorKPI
for states with missing values, I want to impute with the per indicatorKPI mean. Here, this would mean to impute the missing values for Serbia

mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state','germany','serbia', 'austria', 'germany','serbia', 'austria',], 'indicatorKPI':[np.nan,np.nan,'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'], 'value':[np.nan,np.nan,0.9,np.nan,0.7, 0.2, 0.3, 0.6]})

edit

The desired output should be similar to

mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
                                'germany','serbia','serbia', 'austria', 
                                'germany','serbia', 'austria',],
                   'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
                                   'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'],
                     'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]
                   })

EdChum · Accepted Answer

Based on your new example df the following works for me:

In [185]:
mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
mydf

Out[185]:
         Country    indicatorKPI     value
0     no-A-state  SP.DYN.LE00.IN  0.200000
1   no-ISO-state  NY.GDP.MKTP.CD  0.442857
2     no-A-state  SP.DYN.LE00.IN  0.200000
3   no-ISO-state  SP.DYN.LE00.IN  0.442857
4        germany  NY.GDP.MKTP.CD  0.900000
5         serbia  SP.DYN.LE00.IN  0.328571
6         serbia  NY.GDP.MKTP.CD  0.700000
7        austria  NY.GDP.MKTP.CD  0.585714
8        germany  SP.DYN.LE00.IN  0.200000
9         serbia  NY.GDP.MKTP.CD  0.300000
10       austria  SP.DYN.LE00.IN  0.600000

Basically what this does is to fill the missing values for each condition, so we set the min for the 'no-A-state' countries, then mean for 'no-ISO-state' countries. We then groupby on 'indicatorKPI' and calc the mean for each group and assign again to the null value rows, the respective countries' mean using map which performs a lookup

Here are the steps broken down:

In [187]:
mydf.groupby('indicatorKPI')['value'].mean()

Out[187]:

indicatorKPI
NY.GDP.MKTP.CD    0.633333
SP.DYN.LE00.IN    0.400000
Name: value, dtype: float64

In [188]:
mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())

Out[188]:
0     0.400000
1     0.633333
2     0.400000
3     0.400000
4     0.633333
5     0.400000
6     0.633333
7     0.633333
8     0.400000
9     0.633333
10    0.400000
Name: indicatorKPI, dtype: float64

Pandas per group imputation of missing values

How can I achieve such a per-country imputation for each indicator in pandas?

edit

Answers (1)

Related Questions