Roberto Araya
Roberto Araya

Reputation: 21

Can i use sklearn IterativeImputer to fill in missing categorical data?

I have a dataset of categorical and continuos features and a lot of them have missing elements. I was wondering if i can use the respective imputer to fill out continuos as well as categorical data.

And if it cant be done, what would be the best way to proceed? Would it be best to separate the data into continuos and discrete features and then use, for example, IterativeImputer for the first set and KNN for the second one and then merge them?

Any help would be appreciated.

The data consists of 65 features:

x_train

        age         sex painloc painexer relrest    cp   trestbps      htn     chol      smoke      ...     om1     om2 rcaprox rcadist     lvx1    lvx2    lvx3    lvx4    lvf     cathef
288     -1.109572   1.0     0.0     0.0     0.0     1.0     -0.655059   0.0     0.818661    NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     2.0     0.568676
283     -0.180525   1.0     1.0     0.0     0.0     2.0     1.447445    0.0     -0.040919   NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     1.0     NaN
230     -0.077297   1.0     1.0     1.0     0.0     3.0     0.659006    1.0     2.872604    NaN     ...     2.0     NaN     2.0     NaN     1.0     1.0     1.0     1.0     1.0     NaN
380     -0.799890   0.0     1.0     1.0     1.0     4.0     -0.129433   0.0     0.339106    NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     1.0     NaN
147     0.129157    1.0     1.0     1.0     1.0     4.0     NaN     0.0     0.031467    0.0     ...     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     -0.822164
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
847     -0.180525   0.0     NaN     NaN     NaN     3.0     0.185942    1.0     -0.040919   NaN     ...     1.0     NaN     1.0     1.0     1.0     1.0     1.0     1.0     1.0     NaN
301     -0.283752   1.0     1.0     1.0     1.0     4.0     -0.129433   0.0     -0.194738   NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     1.0     NaN
693     0.645295    1.0     NaN     NaN     NaN     4.0     -0.392246   1.0     0.520070    NaN     ...     1.0     NaN     2.0     1.0     1.0     1.0     1.0     1.0     1.0     NaN
115     1.058204    1.0     1.0     1.0     1.0     4.0     NaN     0.0     0.954384    0.0     ...     1.0     1.0     2.0     1.0     1.0     1.0     1.0     1.0     1.0     -0.811925
155     1.574341    1.0     1.0     1.0     1.0     4.0     NaN     1.0     NaN     0.0     ...     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     NaN

I have standarized the continuos variables. There are many categorical features like 'painloc' a 'painexer' that have missing values, and there are also continous ones like 'age' (i decided to treat it as continuous) and 'chol' that also have missing elements.

I tried using IterativeImputer:

x_mice=x_train
mice_impute = IterativeImputer(sample_posterior=True)
x_mice=pd.DataFrame(mice_impute.fit_transform(x_mice))
x_mice.columns=labels
x_mice

     age    sex     painloc     painexer    relrest     cp  trestbps    htn     chol    smoke   ...     om1     om2     rcaprox     rcadist     lvx1    lvx2    lvx3    lvx4    lvf     cathef
0   1.049449    1.0     1.000000    1.000000    1.000000    4.0     0.444874    0.000000    0.540723    0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    -0.891887
1   0.505617    1.0     1.000000    1.000000    0.000000    2.0     -0.266785   0.000000    -1.752150   0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    -0.888760
2   0.831916    1.0     1.000000    0.000000    0.000000    4.0     -1.080109   0.764037    -1.752150   1.450166    ...     1.000000    1.000000    1.000000    1.000000    1.025761    0.879404    -0.400332   3.193691    3.267492    1.118696
3   -0.582047   1.0     1.000000    0.000000    0.000000    2.0     -1.588436   0.000000    -0.249794   0.000000    ...     1.383778    1.048614    -0.147575   1.942328    1.000000    1.000000    1.000000    1.000000    1.000000    0.802084
4   -1.452178   1.0     1.000000    0.000000    0.000000    3.0     0.444874    1.000000    5.232542    1.000000    ...     1.235595    1.249215    2.269437    1.155985    1.000000    1.000000    1.000000    1.000000    1.000000    -1.935223
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
624     0.179318    1.0     1.000000    1.000000    1.000000    4.0     -0.571781   0.000000    0.628910    -0.060307   ...     0.928614    0.830982    1.080936    1.185430    1.000000    1.000000    1.000000    1.000000    1.000000    -1.032691
625     1.702047    1.0     1.000000    0.000000    1.000000    3.0     0.444874    0.000000    -1.752150   0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    2.000000    -0.895014
626     -0.364514   1.0     0.694690    1.738101    0.396025    4.0     0.953201    1.000000    0.390804    1.287500    ...     1.000000    0.739708    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    2.000000    -0.523902
627     0.723149    1.0     0.762459    0.038032    0.315826    4.0     0.444874    1.000000    0.831741    0.750375    ...     1.000000    0.912221    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    2.000000    0.730936
628     0.940682    1.0     1.000000    1.000000    1.000000    4.0     -0.000217   0.000000    -0.252964   0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    -0.888134

It works fine for continuos features but not for categorical as it can fill in a decimal number and thats obviously not right.

Upvotes: 2

Views: 1408

Answers (0)

Related Questions