Reputation: 4669
I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by NA
). I load the data in with genfromtxt
with dtype='f8'
and go about training my classifier.
The classification is fine on RandomForestClassifier
and GradientBoostingClassifier
objects, but using SVC
from sklearn.svm
causes the following error:
probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
X = self._validate_for_predict(X)
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
X = atleast2d_or_csr(X, dtype=np.float64, order="C")
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
assert_all_finite(X)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity
What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..
Upvotes: 27
Views: 18549
Reputation: 3700
The most popular answer here is outdated. "Imputer" is now "SimpleImputer". The current way to solve this issue is given here. Imputing the training and testing data worked for me as follows:
from sklearn import svm
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(x_train)
X_train_imp = imp.transform(x_train)
X_test_imp = imp.transform(x_test)
clf = svm.SVC()
clf = clf.fit(X_train_imp, y_train)
predictions = clf.predict(X_test_imp)
Upvotes: 3
Reputation: 40169
You can either remove the samples with missing features or replace the missing features with their column-wise medians or means.
Upvotes: 6
Reputation: 456
You can do data imputation to handle missing values before using SVM.
EDIT: In scikit-learn, there's a really easy way to do this, illustrated on this page.
(copied from page and modified)
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> # missing_values is the value of your placeholder, strategy is if you'd like mean, median or mode, and axis=0 means it calculates the imputation based on the other feature values for that sample
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit(train)
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> train_imp = imp.transform(train)
Upvotes: 25