Different accuracy for cross_val_score and train_test_split

I am testing RandomForestClassifier on simple dataset from sklearn. When I split the data with train_test_split, I get accuracy=0.89. If I use cross-validation with cross_val_score with same parameters of classifier, accuracy is smaller - about 0.83. Why?

Here is the code:

from sklearn.model_selection import cross_val_score, StratifiedKFold,GridSearchCV,train_test_split
from sklearn.metrics import accuracy_score,f1_score,make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles



np.random.seed(42)


#create dataset:
x, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)

#initialize stratified split:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

#create classifier:

clf = RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1, 
oob_score=True,n_estimators=100,min_samples_leaf=10)


#average accuracy on cross-validation:
results = np.mean(cross_val_score(clf, x, y, cv=skf,scoring=make_scorer(accuracy_score)))
print("ACCURACY WITH CV = ",results)#prints 0.832

#use train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)

clf=RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1, oob_score=True,n_estimators=100,min_samples_leaf=10)
clf.fit(xtrain,ytrain)
ypred=clf.predict(xtest)
print("ACCURACY WITHOUT CV = ",accuracy_score(ytest,ypred))#prints 0.89

what I got: ACCURACY WITH CV = 0.83 ACCURACY WITHOUT CV = 0.89

Upvotes: 0

Views: 598

Answers (2)

Tickloop
Tickloop

Reputation: 76

Cross validation is used to run multiple experiments on different splits of data and then average their results. This is to ensure that the result of the experiment is not biased by one split, as it is in your case.

Your chosen seed along with some luck gave you a test train split which has higher accuracy than the average. The higher accuracy is an artifact of random sampling when making a split and not an indicator of better model performance.

Simply put:

  • Cross Validation makes multiple splits of data. Your model is trained on all of these different splits and then the performance is averaged.

  • If you pick one of these splits, you may get lucky and there might be good overlap between the data points in your test and train set. Your model will have high accuracy in this case.

  • Or you may get unlucky and there might not be a high overlap between the data points in test and train set. Your model will have a lower accuracy in this case.

Thus, cross validation is used to average the results of various such splits (5 in your case).

Here is your code run in a google colab notebook:

https://colab.research.google.com/drive/16-NotF-_WVLESmvGMONSGSZigxrT3KLx?usp=sharing

The last cell makes 5 different splits and then averages their accuracies. Notice how that is the same as the one you got from cross validation. Also notice how some splits have higher and some splits have a lower accuracy.

To further convince yourself, look at the output of:

cross_val_score(clf, x, y, cv=skf, scoring=make_scorer(accuracy_score))

The output is a list of scores (accuracies in your case) for the 5 different splits. You can see that they have varying values around 0.83

Upvotes: 2

Baradrist
Baradrist

Reputation: 192

This is just up to chance for the split and the random state of the Random Forest Classifier. Try leaving the random_state=42 out and let it fit several times and you'll get a variance of different accuracies. By chance, I had one without CV of "just" 0.78! In contrast, the cv will give you and average (your calculated mean) PLUS an idea about how much your accuracy could vary around that.

Upvotes: 1

Related Questions