Georg Heiler
Georg Heiler

Reputation: 17676

Pandas per group imputation of missing values

How can I achieve such a per-country imputation for each indicator in pandas?

I want to impute the missing values per group

edit

The desired output should be similar to

mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
                                'germany','serbia','serbia', 'austria', 
                                'germany','serbia', 'austria',],
                   'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
                                   'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'],
                     'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]
                   })

enter image description here

Upvotes: 3

Views: 2865

Answers (1)

EdChum
EdChum

Reputation: 394003

Based on your new example df the following works for me:

In [185]:
mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
mydf

Out[185]:
         Country    indicatorKPI     value
0     no-A-state  SP.DYN.LE00.IN  0.200000
1   no-ISO-state  NY.GDP.MKTP.CD  0.442857
2     no-A-state  SP.DYN.LE00.IN  0.200000
3   no-ISO-state  SP.DYN.LE00.IN  0.442857
4        germany  NY.GDP.MKTP.CD  0.900000
5         serbia  SP.DYN.LE00.IN  0.328571
6         serbia  NY.GDP.MKTP.CD  0.700000
7        austria  NY.GDP.MKTP.CD  0.585714
8        germany  SP.DYN.LE00.IN  0.200000
9         serbia  NY.GDP.MKTP.CD  0.300000
10       austria  SP.DYN.LE00.IN  0.600000

Basically what this does is to fill the missing values for each condition, so we set the min for the 'no-A-state' countries, then mean for 'no-ISO-state' countries. We then groupby on 'indicatorKPI' and calc the mean for each group and assign again to the null value rows, the respective countries' mean using map which performs a lookup

Here are the steps broken down:

In [187]:
mydf.groupby('indicatorKPI')['value'].mean()

Out[187]:

indicatorKPI
NY.GDP.MKTP.CD    0.633333
SP.DYN.LE00.IN    0.400000
Name: value, dtype: float64

In [188]:
mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())

Out[188]:
0     0.400000
1     0.633333
2     0.400000
3     0.400000
4     0.633333
5     0.400000
6     0.633333
7     0.633333
8     0.400000
9     0.633333
10    0.400000
Name: indicatorKPI, dtype: float64

Upvotes: 3

Related Questions