Reputation: 398
I am trying to implement a machine learning algorithm which detects irregular ecg signals. I extracted some features, but I am not sure how to manage a correct input for the classifier.
I have 20k different ecg signals, each signal has 1000 values. They are all labeld as correct or incorrect.
I choose e.g. the two features heart_rate and xposition_of_3_highest_peaks, but how to feed them into the classifier?
Following you can see my attempt, but everytime I add a second feature the score decreases. Why?
clf = svm.SVC()
#[64,70,48,89...74,58]
X_train_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
X_train_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
I am not sure if the StandardScaler().fit_transform is necessary or if the np.concatenate is correct? Maybe there is even a better classifier for this use case?
Sorry I am a complete beginner, please be kind :)
Upvotes: 1
Views: 311
Reputation: 104474
When you are doing any transformations for pre-processing, you must use the same process from the training data and apply it to the validation / test data. However, this process must use the same statistics from the training data, because you are assuming that the validation / test data also come from this same distribution. Therefore, you need to create an object to store the transformations of the training data, then apply it to the training and test data equally. Your decreased performance is because you are not applying the right statistics to both training and validation / test correctly. You are scaling both datasets using separate means and standard deviations, which can cause out-of-distribution predictions if your sample size isn't large enough.
Therefore, call fit_transform
on the training data, then just transform
on the validation / test data. fit_transform
will simultaneously find the parameters of the scaling for each column, then apply it to the input data and return the transformed data to you. transform
assumes an already fit scaler, such as what was done in fit_transform
and applies the scaling accordingly. I sometimes like to separate the operations and do a separate fit
on the training data, then transform
on the training and validation/test data after. This is a common source of confusion for new practitioners. You also need to save the scaler object so you can apply this to your validation / test data later.
clf = svm.SVC()
#[64,70,48,89...74,58]
heartRate_scaler = StandardScaler()
X_train_heartRate = heartRate_scaler.fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = heartRate_scaler.transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
three_peaks_scalar = StandardScaler()
X_train_3_peaks = three_peaks_scalar.fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = three_peaks_scalar.transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
Take note that you can concatenate the features you want first, then apply the StandardScaler
after the fact because the method applies the standardization to each feature/column independently. The above method of scaling the different sets of features and concatenating them after is no different than concatenating the features first, then scaling after.
I forgot to ask about the fe
object. What is that doing under the hood? Does it use the training data in any way to get you features? You must make sure that this object operates on the statistics of your training data and test data, not separately. What I mentioned about ensuring that the pre-processing must match between training and validation/test, the statistics must also match in this fe
object as well. I assume this either uses the training data's statistics to both sets of data, or it is an independent transformation that is agnostic. Either way, you haven't specified what this is doing under the hood, but I will assume the happy path.
Consider using a decision tree-based algorithm like a Random Forest Classifier that does not require scaling of the input features, as the job is to partition the feature space of your data into N
-dimensional hypercubes, with N
being the number of features in your dataset (if N=2
, this would be a 2D rectangle, N=3
a 3D rectangle, etc). Depending on how your data is distributed, tree-based algorithms can do better and are the first things to try in Kaggle competitions.
Upvotes: 1