Akhmad Zaki
Akhmad Zaki

Reputation: 433

Cross Validation Python Sklearn

I want to do Cross Validation on my SVM classifier before using it on the actual test set. What I want to ask is do I do the cross validation on the original dataset or on the training set, which is the result of train_test_split() function?

import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC

df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)

kfold = KFold(n_splits=10, random_state=seed)

svm = SVC(kernel='poly')
results = cross_val_score(svm, X, y, cv=kfold) #Cross validation on original set

or

import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC

df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)

kfold = KFold(n_splits=10, random_state=seed)

svm = SVC(kernel='poly')
results = cross_val_score(svm, X_train, y_train, cv=kfold) #Cross validation on training set

Upvotes: 0

Views: 2841

Answers (1)

JahKnows
JahKnows

Reputation: 2706

It is best to always reserve a test set that is only used once you are satisfied with your model, right before deploying it. So do your train/test split, then set the testing set aside. We will not touch that.

Perform the cross-validation only on the training set. For each of the k folds you will use a part of the training set to train, and the rest as a validations set. Once you are satisfied with your model and your selection of hyper-parameters. Then use the testing set to get your final benchmark.

Your second block of code is correct.

Upvotes: 4

Related Questions