Md. Rezwanul Haque
Md. Rezwanul Haque

Reputation: 2950

How to handle missing data with respect to type of dataset?

I have a dataset where has column types that has type like Primary , Secondary.

df

   ID    types        C   D
0  101   Primary      2   3
1  103   Primary      6   3
2  108   Primary     10   ?
3  109   Primary      3  12
4  118   Secondary    5   2
5  122   Secondary    ?   6
6  123   Secondary    5   6
7  125   Secondary    2   5 

I want to replace missing value with median for each type. Such as-

result_df

   ID    types        C   D
0  101   Primary      2   3
1  103   Primary      6   3
2  108   Primary     10   3
3  109   Primary      3  12
4  118   Secondary    5   2
5  122   Secondary    5   6
6  123   Secondary    5   6
7  125   Secondary    2   5 

How can do it with Python?

Upvotes: 0

Views: 699

Answers (2)

Venkatachalam
Venkatachalam

Reputation: 16966

As mentioned by @Mayank Porwal, first convert the missing values into np.nan and then you can apply imputation with sklearn impute methods.

simpleImputer

import numpy as np
import pandas as pd

df.replace('?',np.nan,inplace=True)

from sklearn.impute import SimpleImputer

for types,group in df.groupby('types'):
    imp = SimpleImputer(missing_values=np.nan, strategy='median')
    df.loc[df['types']==types,['C','D']] = imp.fit_transform(group[['C','D']])

Upvotes: 1

Mayank Porwal
Mayank Porwal

Reputation: 34046

Something like this should work:

First replace ? in your df with actual np.nan values:

In [1268]: df = df.replace('?',np.nan)
In [1273]: df
Out[1273]: 
    ID      types    C    D
0  101    Primary    2    3
1  103    Primary    6    3
2  108    Primary   10  NaN
3  109    Primary    3   12
4  118  Secondary    5    2
5  122  Secondary  NaN    6
6  123  Secondary    5    6
7  125  Secondary    2    5

For me the dtypes is showing as object for columns C and D. Hence, I convert these into numeric before finding median. If this is not applicable for you, skip this step and directly run the below command with transform function.

In [1274]: df.dtypes
Out[1274]: 
ID        int64
types    object
C        object
D        object
dtype: object

In order to find median, convert columns C and D into pandas numeric type:

In [1256]: df.C = df.C.apply(pd.to_numeric)
In [1258]: df.D = df.D.apply(pd.to_numeric)

In [1279]: df.dtypes
Out[1279]: 
ID         int64
types     object
C        float64
D        float64
dtype: object

Now, you can fill nulls with median of types in both columns C and D like below, using groupby and transform functions:

In [1265]: df.C = df.C.fillna(df.groupby('types')['C'].transform('median'))

In [1266]: df.D = df.D.fillna(df.groupby('types')['D'].transform('median'))

In [1267]: df
Out[1267]: 
    ID      types     C     D
0  101    Primary   2.0   3.0
1  103    Primary   6.0   3.0
2  108    Primary  10.0   3.0
3  109    Primary   3.0  12.0
4  118  Secondary   5.0   2.0
5  122  Secondary   5.0   6.0
6  123  Secondary   5.0   6.0
7  125  Secondary   2.0   5.0

Let me know if this helps.

Upvotes: 2

Related Questions