Sakshi Kumar
Sakshi Kumar

Reputation: 33

Python SVM Classifier - issues with input NaNs and data shape

I am trying to build a binary SVM classifier with ECG data to diagnose sleep apnea. With 16,000 odd inputs I'm performing wavelet transform, manually extracting HRV features and storing them in a feature list, and feeding this list into the classifier.

This worked fine with the raw data before I preprocessed it with the Wavelet transform step - some values in the feature list became nan after the transform which meant I got this error for this line of code:

clf.fit(X_train, y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

so I executed this step:

x = pd.DataFrame(data=X_train)
x=x[~x.isin([np.nan, np.inf, -np.inf]).any(1)]

which solved the ValueError but removing the 'faulty' inputs meant the shapes of X_train and y_train don't match up:

clf.fit(x, y_train)

#error
Found input variables with inconsistent numbers of samples: [11255, 11627]

I am struggling to figure out how to remove the corresponding values from y_train to match up the samples? Or is there a better approach to this?

Please let me know if you need more info on the code.

Upvotes: 1

Views: 230

Answers (1)

gnodab
gnodab

Reputation: 878

Without sample data, it is impossible to test. But you are testing for valid data in the X_train dataframe. Which is good. Now you just need to remove the corresponding Y_train labels. Something like this:

x = pd.DataFrame(data=X_train)
valid_indexes = ~x.isin([np.nan, np.inf, -np.inf]).any(1)
x=x[valid_indexes]

Y_train = Y_train[valid_indexes]

Make sure you are always testing for valid data on the X_train data. This is because, I presume, that all of the labels are valid.

Upvotes: 2

Related Questions