Reputation: 77
I'm learning about cross validation using scikit-learn (http://scikit-learn.org/stable/modules/cross_validation.html)
My code:
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
# prepare sets
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)
# create model
clf1 = svm.SVC(kernel='linear', C=1)
# train model
scores = cross_val_score(clf1, x_train, y_train, cv=5)
# accuracy on train data
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
# accuracy on yet-unseen data
print clf1.score(x_test, y_test)
I understand that with cross-validation we can use whole dataset to train and validate as in example in scikit doc. What if I want to score data after cross-validate? I assume my model is trained after learning with cross-validation. While using score()
I get
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: This SVC instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
In doc there is method paragraph 3.1.1.1 where cross_val_predict
is mentioned and I could use it but why I need cv argument (which is fold number) while I just want to check accuracy of trained model?
I would be thankful for any hint.
Upvotes: 1
Views: 4565
Reputation: 13723
Here's a code that gets the job done with a step-by-step explanation of how it works.
To begin with, let us import the necessary modules:
In [204]: from sklearn.model_selection import cross_val_score, StratifiedKFold
In [205]: from sklearn import datasets
In [206]: from sklearn import svm
You should make certain you have installed scikit-learn 0.18, otherwise the following code might not work. Please, notice I'm using sklearn.model_selection
instead of sklearn.cross_validation
because the latter is deprecated in version 0.18.
Then we load the iris dataset and create arrays X
and y
with the features and labels, respectively
In [207]: iris = datasets.load_iris()
In [208]: X, y = iris.data, iris.target
In the next step we create an instance of the C-Support Vector Classification class:
In [209]: clf = svm.SVC(kernel='linear', C=1)
Now we create a stratified K-Folds validator which splits the dataset into 5 disjoint subsets, namely A, B, C, D and E. These five folds are stratified, which means that the proportion of samples of each class in A, B, C, D and E are the same as in the overall dataset.
In [210]: skf = StratifiedKFold(n_splits=5, random_state=0)
Finally, we estimate the generalization accuracy through 5 classification trials:
In [211]: scores = cross_val_score(clf, X, y, cv=skf)
In [212]: scores
Out[212]: array([ 0.9667, 1. , 0.9667, 0.9667, 1. ])
In [213]: scores.mean()
Out[213]: 0.98000000000000009
5-Folds cross-validation can be summarized as follows:
Classification No. Training Samples Test Samples Accuracy
1 A + B + C + D E 0.9667
2 A + B + C + E D 1.
3 A + B + D + E C 0.9667
4 A + C + D + E B 0.9667
5 B + C + D + E A 1.
It clearly emerges from the table above that each sample is used four times for training and is tested only once.
Replies to your additional comments:
Upvotes: 3