Reputation: 13
I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation. I went over the documentation for cross_val_predict
as I believe that this is the module I am going to need.
What I am having trouble with, is the splitting of the data set. As far as I am aware, in the usual case, the train_test_split()
method is used to split the data set into 2 - the train and the test. From my understanding, for K-fold validation you need to further split the train set into K-number of parts.
My question is: do I need to split the data set at the beginning into train and test, or not?
Upvotes: 0
Views: 5789
Reputation: 1451
It depends. My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds. Why ? Because it is interesting to test after your training and fine-tuning your model on unseen example.
But some guys just do a cross-val. Here is the workflow I often use:
# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)
# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))
# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
model = model_with_param
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
print('Mean CV-score with param: ' + str(cv_score.mean()))
# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)
# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)
Upvotes: 4
Reputation: 33197
Short answer: NO
Long answer.
If you want to use K-fold validation
when you do not usually split initially into train/test
.
There are a lot of ways to evaluate a model. The simplest one is to use train/test
splitting, fit the model on the train
set and evaluate using the test
.
If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.
It's up to you what to choose but I would go with K-Folds or LOOCV.
K-Folds procedure is summarised in the figure (for K=5):
Upvotes: 1