PawelPawel
PawelPawel

Reputation: 77

Scikitlearn - score dataset after cross-validation

I'm learning about cross validation using scikit-learn (http://scikit-learn.org/stable/modules/cross_validation.html)

My code:

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()

# prepare sets
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

# create model
clf1 = svm.SVC(kernel='linear', C=1)

# train model
scores = cross_val_score(clf1, x_train, y_train, cv=5)

# accuracy on train data
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# accuracy on yet-unseen data
print clf1.score(x_test, y_test)

I understand that with cross-validation we can use whole dataset to train and validate as in example in scikit doc. What if I want to score data after cross-validate? I assume my model is trained after learning with cross-validation. While using score() I get

raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: This SVC instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

In doc there is method paragraph 3.1.1.1 where cross_val_predict is mentioned and I could use it but why I need cv argument (which is fold number) while I just want to check accuracy of trained model?

I would be thankful for any hint.

Upvotes: 1

Views: 4565

Answers (1)

Tonechas
Tonechas

Reputation: 13723

Here's a code that gets the job done with a step-by-step explanation of how it works.

To begin with, let us import the necessary modules:

In [204]: from sklearn.model_selection import cross_val_score, StratifiedKFold

In [205]: from sklearn import datasets

In [206]: from sklearn import svm

You should make certain you have installed scikit-learn 0.18, otherwise the following code might not work. Please, notice I'm using sklearn.model_selection instead of sklearn.cross_validation because the latter is deprecated in version 0.18.

Then we load the iris dataset and create arrays X and y with the features and labels, respectively

In [207]: iris = datasets.load_iris()

In [208]: X, y = iris.data, iris.target

In the next step we create an instance of the C-Support Vector Classification class:

In [209]: clf = svm.SVC(kernel='linear', C=1)

Now we create a stratified K-Folds validator which splits the dataset into 5 disjoint subsets, namely A, B, C, D and E. These five folds are stratified, which means that the proportion of samples of each class in A, B, C, D and E are the same as in the overall dataset.

In [210]: skf = StratifiedKFold(n_splits=5, random_state=0)

Finally, we estimate the generalization accuracy through 5 classification trials:

In [211]: scores = cross_val_score(clf, X, y, cv=skf)

In [212]: scores
Out[212]: array([ 0.9667,  1.    ,  0.9667,  0.9667,  1.    ])

In [213]: scores.mean()
Out[213]: 0.98000000000000009

5-Folds cross-validation can be summarized as follows:

Classification No.   Training Samples   Test Samples   Accuracy
1                    A + B + C + D      E              0.9667
2                    A + B + C + E      D              1.
3                    A + B + D + E      C              0.9667
4                    A + C + D + E      B              0.9667
5                    B + C + D + E      A              1.

It clearly emerges from the table above that each sample is used four times for training and is tested only once.

Replies to your additional comments:

  1. The main advantage of cross-validation is that all the samples are used for both training and testing, i.e. cross-validation provides you with the maximum modelling and testing capability, which is particularly important when the dataset is small.
  2. One way of avoiding overfitting is by using different samples for training and testing.
  3. A common approach for chosing the model parameters consists in validating the model for different parameter sets and selecting those values that maximize the classification accuracy.

Upvotes: 3

Related Questions