Reputation: 7704
I have a 2d numpy array that was created with:
array = dataset.to_numpy()
X = array[:, 1:]
I want to use OrdinalEncoder, but there are some Nans in X that I want to impute. I can't run OrdinalEncoder because it doesn't like the Nans and I can't run the KNNImputer until I encode.
I know I can replace the Nan with something like '?', etc and then OrdinalEncoder() will work, but then I have to go through and replace the numbers that the '?' turned into back to Nan. That means looping through the OrdinalEncoder internals to figure out what the '?' was mapped to in each column and then doing a replace on that column.
Isn't there a better way to do this? I was trying to get masking to work, but couldn't figure it out. I need to operate on X and not the dataset.
Upvotes: 1
Views: 942
Reputation: 46908
If you don't need the encoder anymore, then you can use pd.factorize
, which converts all np.nan to -1
and you need to put that back as np.nan
:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
X = pd.DataFrame({'x1':['a','b',np.NaN],'x2':[np.NaN,'c','d']})
x1 x2
0 a NaN
1 b c
2 NaN d
X.apply(lambda x:x.factorize()[0]).replace(-1,np.nan)
x1 x2
0 0.0 NaN
1 1.0 0.0
2 NaN 1.0
Upvotes: 0
Reputation: 3260
Too long for a comment, but if you don't mind some copying you can simply shuffle the NaN
s out of the array temporarily.
array = dataset.to_numpy()
X = array[:, 1:]
nan_free_mask = ~np.isnan(X)
nan_free_X = X[nan_free_mask]
nan_free_encoded = OrdinalEncoder.fit_transform(nan_free_X, ...)
X_encoded = X.copy()
X_encoded[nan_free_mask] = nan_free_encoded
X_encoded = KNNImputer(...).fit_transform(X_encoded)
There is also nothing wrong with your idea of replacing nan
with ?
either. You simply need to remember where it happened. As far as I am aware, the OrdinalEncoder doesn't shuffle your data, but I could be wrong:
array = dataset.to_numpy()
X = array[:, 1:]
nan_mask = np.isnan(X)
X[nan_mask] = '?'
X_encoded = OrdinalEncoder.fit_transform(X, ...)
X_encoded[nan_mask] = np.nan # restore NaN
X_encoded = KNNImputer(...).fit_transform(X_encoded)
Then again, you may have thought of this already ... if so, please update the question and specify what you have tried.
Upvotes: 2