Reputation: 33
I am trying to build a binary SVM classifier with ECG data to diagnose sleep apnea. With 16,000 odd inputs I'm performing wavelet transform, manually extracting HRV features and storing them in a feature list, and feeding this list into the classifier.
This worked fine with the raw data before I preprocessed it with the Wavelet transform step - some values in the feature list became nan
after the transform which meant I got this error for this line of code:
clf.fit(X_train, y_train)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
so I executed this step:
x = pd.DataFrame(data=X_train)
x=x[~x.isin([np.nan, np.inf, -np.inf]).any(1)]
which solved the ValueError but removing the 'faulty' inputs meant the shapes of X_train and y_train don't match up:
clf.fit(x, y_train)
#error
Found input variables with inconsistent numbers of samples: [11255, 11627]
I am struggling to figure out how to remove the corresponding values from y_train to match up the samples? Or is there a better approach to this?
Please let me know if you need more info on the code.
Upvotes: 1
Views: 230
Reputation: 878
Without sample data, it is impossible to test. But you are testing for valid data in the X_train
dataframe. Which is good. Now you just need to remove the corresponding Y_train
labels. Something like this:
x = pd.DataFrame(data=X_train)
valid_indexes = ~x.isin([np.nan, np.inf, -np.inf]).any(1)
x=x[valid_indexes]
Y_train = Y_train[valid_indexes]
Make sure you are always testing for valid data on the X_train
data. This is because, I presume, that all of the labels are valid.
Upvotes: 2