Reputation: 21
I have a dataset of categorical and continuos features and a lot of them have missing elements. I was wondering if i can use the respective imputer to fill out continuos as well as categorical data.
And if it cant be done, what would be the best way to proceed? Would it be best to separate the data into continuos and discrete features and then use, for example, IterativeImputer for the first set and KNN for the second one and then merge them?
Any help would be appreciated.
The data consists of 65 features:
x_train
age sex painloc painexer relrest cp trestbps htn chol smoke ... om1 om2 rcaprox rcadist lvx1 lvx2 lvx3 lvx4 lvf cathef
288 -1.109572 1.0 0.0 0.0 0.0 1.0 -0.655059 0.0 0.818661 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 2.0 0.568676
283 -0.180525 1.0 1.0 0.0 0.0 2.0 1.447445 0.0 -0.040919 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 NaN
230 -0.077297 1.0 1.0 1.0 0.0 3.0 0.659006 1.0 2.872604 NaN ... 2.0 NaN 2.0 NaN 1.0 1.0 1.0 1.0 1.0 NaN
380 -0.799890 0.0 1.0 1.0 1.0 4.0 -0.129433 0.0 0.339106 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 NaN
147 0.129157 1.0 1.0 1.0 1.0 4.0 NaN 0.0 0.031467 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 -0.822164
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
847 -0.180525 0.0 NaN NaN NaN 3.0 0.185942 1.0 -0.040919 NaN ... 1.0 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
301 -0.283752 1.0 1.0 1.0 1.0 4.0 -0.129433 0.0 -0.194738 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 NaN
693 0.645295 1.0 NaN NaN NaN 4.0 -0.392246 1.0 0.520070 NaN ... 1.0 NaN 2.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
115 1.058204 1.0 1.0 1.0 1.0 4.0 NaN 0.0 0.954384 0.0 ... 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 -0.811925
155 1.574341 1.0 1.0 1.0 1.0 4.0 NaN 1.0 NaN 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
I have standarized the continuos variables. There are many categorical features like 'painloc' a 'painexer' that have missing values, and there are also continous ones like 'age' (i decided to treat it as continuous) and 'chol' that also have missing elements.
I tried using IterativeImputer:
x_mice=x_train
mice_impute = IterativeImputer(sample_posterior=True)
x_mice=pd.DataFrame(mice_impute.fit_transform(x_mice))
x_mice.columns=labels
x_mice
age sex painloc painexer relrest cp trestbps htn chol smoke ... om1 om2 rcaprox rcadist lvx1 lvx2 lvx3 lvx4 lvf cathef
0 1.049449 1.0 1.000000 1.000000 1.000000 4.0 0.444874 0.000000 0.540723 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -0.891887
1 0.505617 1.0 1.000000 1.000000 0.000000 2.0 -0.266785 0.000000 -1.752150 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -0.888760
2 0.831916 1.0 1.000000 0.000000 0.000000 4.0 -1.080109 0.764037 -1.752150 1.450166 ... 1.000000 1.000000 1.000000 1.000000 1.025761 0.879404 -0.400332 3.193691 3.267492 1.118696
3 -0.582047 1.0 1.000000 0.000000 0.000000 2.0 -1.588436 0.000000 -0.249794 0.000000 ... 1.383778 1.048614 -0.147575 1.942328 1.000000 1.000000 1.000000 1.000000 1.000000 0.802084
4 -1.452178 1.0 1.000000 0.000000 0.000000 3.0 0.444874 1.000000 5.232542 1.000000 ... 1.235595 1.249215 2.269437 1.155985 1.000000 1.000000 1.000000 1.000000 1.000000 -1.935223
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
624 0.179318 1.0 1.000000 1.000000 1.000000 4.0 -0.571781 0.000000 0.628910 -0.060307 ... 0.928614 0.830982 1.080936 1.185430 1.000000 1.000000 1.000000 1.000000 1.000000 -1.032691
625 1.702047 1.0 1.000000 0.000000 1.000000 3.0 0.444874 0.000000 -1.752150 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 -0.895014
626 -0.364514 1.0 0.694690 1.738101 0.396025 4.0 0.953201 1.000000 0.390804 1.287500 ... 1.000000 0.739708 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 -0.523902
627 0.723149 1.0 0.762459 0.038032 0.315826 4.0 0.444874 1.000000 0.831741 0.750375 ... 1.000000 0.912221 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 0.730936
628 0.940682 1.0 1.000000 1.000000 1.000000 4.0 -0.000217 0.000000 -0.252964 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -0.888134
It works fine for continuos features but not for categorical as it can fill in a decimal number and thats obviously not right.
Upvotes: 2
Views: 1408