Sklearn: how to get mean squared error on classifying training data

Question

I'm trying to do some classification problems using sklearn for the first time in Python, and was wondering what was the best way to go about calculating the error of my classifier (like a SVM) solely on the training data.

My sample code for calculating accuracy and rmse are as follows:

    svc = svm.SVC(kernel='rbf', C=C, decision_function_shape='ovr').fit(X_train, y_train.ravel())
    prediction = svc.predict(X_test)
    svm_in_accuracy.append(svc.score(X_train,y_train))
    svm_out_rmse.append(sqrt(mean_squared_error(prediction, np.array(list(y_test)))))
    svm_out_accuracy.append((np.array(list(y_test)) == prediction).sum()/(np.array(list(y_test)) == prediction).size)

I know from 'sklearn.metrics import mean_squared_error' can pretty much get me the MSE for an out-of-sample comparison. What can I do in sklearn to give me an error metric on how my well/not well my model misclassified on the training data? I ask this because I know my data is not perfectly linearly separable (which means the classifier will misclassify some items), and I want to know the best way to get an error metric on how much it was off. Any help would be appreciated!

KRKirov · Accepted Answer

To evaluate you classifier you can use the following metrics:

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

The confusion matrix has the predicted labels as columns headings and the true labels are row labels. The main diagonal of the confusion matrix shows the number of correctly assigned labels. Any off-diagonal elements contain the number of incorrectly assigned labels. From the confusion matrix, you can also calculate accuracy, precision and recall. Both the classification report and the confusion matrix are straightforward to use - you pass the test and predicted labels to the functions:

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[1047    5]
 [   0  448]]

            precision    recall  f1-score   support

        0.0       1.00      1.00      1.00      1052
        1.0       0.99      1.00      0.99       448

avg / total       1.00      1.00      1.00      1500

The other metrics functions calculate and plot the Receiver Operating Characteristic (ROC) and the Area under Curve (AUC) of the ROC. You can read about ROC here:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

Sklearn: how to get mean squared error on classifying training data

Answers (1)

Related Questions