Reputation: 365
I am doing some classification tasks on heart disease dataset using C5.0 in R, in most common case the data will be divided into 80% for training, and 20% for testing, I want to use k-fold cross validation (k=10), but I am confused about this point, as we know by using 10-fold cross validation, we will divide the whole data into 9 subsets for train and one subset for the test.
Is it possible to divide the data into 80% for training and 20% for testing and then applying k-fold cross-validation on train data? or I have to apply k-fold cross-validation on the whole data set?
Upvotes: 3
Views: 817
Reputation: 4495
Applying k-fold cross-validation on the whole data set is a better option. As in this approach, the data will be divided into k folds, in which k-1 folds is used for training and the remaining 1 fold is used for testing. In this way, you will get the performance on the complete data once the cross-validation is over.
But a point to take care is that for most the classification problems, parameter tuning is an important step. So for this you may consider possibly 50% of the data to find the optimal parameters of the classifier. Use cross-validation approach here as well.
Upvotes: 0
Reputation: 513
One option would be k=5. In this case you train with 80% and test with 20%. But for that you don't need to use k-fold cross-validation.
k-fold cross-validation is always on the whole data set. So with k=5 there are 5 possible scenarios that are tested and compared.
Upvotes: 1