Reputation: 1223
I have a feature matrix with missing values NaNs, so I need to initialize those missing values first. However, the last line complains and throws out the following line of error:
Expected sequence or array-like, got Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
.
I checked, it seems the reason is that train_fea_imputed is not in np.array format, but sklearn.preprocessing.imputation.Imputer form. How should I fix this?
BTW, if I use train_fea_imputed = imp.fit_transform(train_fea), the code works fine, but train_fea_imputed return an array with 1 dimension less than train_fea
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
train_fea_imputed = imp.fit(train_fea)
# train_fea_imputed = imp.fit_transform(train_fea)
rf = RandomForestClassifier(n_estimators=5000,n_jobs=1, min_samples_leaf = 3)
rf.fit(train_fea_imputed, train_label)
update: I changed to
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
and now the dimension problem did not occur. I think there is some inherent issues in the imputing function. I will come back when I finish the project.
Upvotes: 1
Views: 4978
Reputation: 11
I think that axis = 1 is not correct in this case since you want to take mean across the values of feature vector/column (axis = 0) and not row (axis = 1).
Upvotes: 1
Reputation: 36545
With scikit-learn
, initialising the model, training the model and getting the predictions are seperate steps. In your case you have:
train_fea = np.array([[1,1,0],[0,0,1],[1,np.nan,0]])
train_fea
array([[ 1., 1., 0.],
[ 0., 0., 1.],
[ 1., nan, 0.]])
#initialise the model
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
#train the model
imp.fit(train_fea)
#get the predictions
train_fea_imputed = imp.transform(train_fea)
train_fea_imputed
array([[ 1. , 1. , 0. ],
[ 0. , 0. , 1. ],
[ 1. , 0.5, 0. ]])
Upvotes: 4