Ulu83
Ulu83

Reputation: 563

Should I first train_test_split and then use cross validation?

If I plan to use cross validation (KFold), should I still split the dataset into training and test data and perform my training (including cross valid) only on the training set? Or will CV do everything for me? E.g.

Option 1

X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = GridSearchCV(... cv=5) 
clf.fit(X_train, y_train)

Option 2

clf = GridSearchCV(... cv=5) 
clf.fit(X y)

Upvotes: 5

Views: 3843

Answers (1)

CrazyElf
CrazyElf

Reputation: 765

CV is good, but it's better to have train/test split to provide independent score estimation on the untouched data. If your CV and test data shows about the same score, then you can drop train/test split phase and CV on whole data to achieve slightly better model score. But don't do it before you sure your split and CV score is consistent.

Upvotes: 1

Related Questions