Noor
Noor

Reputation: 365

R: k-fold cross-validation for train data set

I am doing some classification tasks on heart disease dataset using C5.0 in R, in most common case the data will be divided into 80% for training, and 20% for testing, I want to use k-fold cross validation (k=10), but I am confused about this point, as we know by using 10-fold cross validation, we will divide the whole data into 9 subsets for train and one subset for the test.

Is it possible to divide the data into 80% for training and 20% for testing and then applying k-fold cross-validation on train data? or I have to apply k-fold cross-validation on the whole data set?

Upvotes: 3

Views: 817

Answers (2)

prashanth
prashanth

Reputation: 4495

Applying k-fold cross-validation on the whole data set is a better option. As in this approach, the data will be divided into k folds, in which k-1 folds is used for training and the remaining 1 fold is used for testing. In this way, you will get the performance on the complete data once the cross-validation is over.

But a point to take care is that for most the classification problems, parameter tuning is an important step. So for this you may consider possibly 50% of the data to find the optimal parameters of the classifier. Use cross-validation approach here as well.

Upvotes: 0

Dan
Dan

Reputation: 513

One option would be k=5. In this case you train with 80% and test with 20%. But for that you don't need to use k-fold cross-validation.

k-fold cross-validation is always on the whole data set. So with k=5 there are 5 possible scenarios that are tested and compared.

Upvotes: 1

Related Questions