Titus Pullo
Titus Pullo

Reputation: 3821

Cross Validation in Weka

I've always thought from what I read that cross validation is performed like this:

In k-fold cross-validation, the original sample is randomly partitioned into k subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation

So k models are built and the final one is the average of those. In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?

Upvotes: 29

Views: 67946

Answers (6)

Nov Joy
Nov Joy

Reputation: 21

According to "Data Mining with Weka" at The University of Waikato:

Cross-validation is a way of improving upon repeated holdout.
Cross-validation is a systematic way of doing repeated holdout that actually improves upon it by reducing the variance of the estimate.

  • We take a training set and we create a classifier
  • Then we’re looking to evaluate the performance of that classifier, and there’s a certain amount of variance in that evaluation, because it’s all statistical underneath.
  • We want to keep the variance in the estimate as low as possible.
    Cross-validation is a way of reducing the variance, and a variant on cross-validation called “stratified cross-validation” reduces it even further. (In contrast to the the “repeated holdout” method in which we hold out 10% for the testing and we repeat that 10 times.)

So how does cross validation in Weka work ?:
With cross-validation, we divide our dataset just once, but we divide into k pieces, for example , 10 pieces.
Then we take 9 of the pieces and use them for training and the last piece we use for testing. Then with the same division, we take another 9 pieces and use them for training and the held-out piece for testing. We do the whole thing 10 times, using a different segment for testing each time. In other words, we divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results.


That would be 10-fold cross-validation. Divide the dataset into 10 parts (these are called “folds”); hold out each part in turn; and average the results. So each data point in the dataset is used once for testing and 9 times for training.
That’s 10-fold cross-validation.

Upvotes: 0

Hein Blöd
Hein Blöd

Reputation: 1653

I would have answered in a comment but my reputation still doesn't allow me to:

In addition to Rushdi's accepted answer, I want to emphasize that the models which are created for the cross-validation fold sets are all discarded after the performance measurements have been carried out and averaged.

The resulting model is always based on the full training set, regardless of your test options. Since M-T-A was asking for an update to the quoted link, here it is: https://web.archive.org/web/20170519110106/http://list.waikato.ac.nz/pipermail/wekalist/2009-December/046633.html/. It's an answer from one of the WEKA maintainers, pointing out just what I wrote.

Upvotes: 11

Mahaveer Jain
Mahaveer Jain

Reputation: 106

once we've done the 10-cross-validation by dividing data in 10 segments & create Decision tree and evaluate, what Weka does is run the algorithm an eleventh time on the whole dataset. That will then produce a classifier that we might deploy in practice. We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, and then finally we do classification one more time to get an actual classifier to use in practice. During kth cross validation, we will going to have different Decision tree but final one is created on whole datasets. CV is used to see if we have overfitting or large variance issue.

Upvotes: 1

cwins
cwins

Reputation: 51

I think I figured it out. Take (for example) weka.classifiers.rules.OneR -x 10 -d outmodel.xxx. This does two things:

  1. It creates a model based on the full dataset. This is the model that is written to outmodel.xxx. This model is not used as part of cross-validation.
  2. Then cross-validation is run. cross-validation involves creating (in this case) 10 new models with the training and testing on segments of the data as has been described. The key is the models used in cross-validation are temporary and only used to generate statistics. They are not equivalent to, or used for the model that is given to the user.

Upvotes: 5

Rushdi Shams
Rushdi Shams

Reputation: 2423

So, here is the scenario again: you have 100 labeled data

Use training set

  • weka will take 100 labeled data
  • it will apply an algorithm to build a classifier from these 100 data
  • it applies that classifier AGAIN on these 100 data
  • it provides you with the performance of the classifier (applied to the same 100 data from which it was developed)

Use 10 fold CV

  • Weka takes 100 labeled data

  • it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.

  • it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.

  • It does the same thing for set 2 to 10 and produces 9 more classifiers

  • it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets

Let me know if that answers your question.

Upvotes: 52

Rushdi Shams
Rushdi Shams

Reputation: 2423

Weka follows the conventional k-fold cross validation you mentioned here. You have the full data set, then divide it into k nos of equal sets (k1, k2, ... , k10 for example for 10 fold CV) without overlaps. Then at the first run, take k1 to k9 as training set and develop a model. Use that model on k10 to get the performance. Next comes k1 to k8 and k10 as training set. Develop a model from them and apply it to k9 to get the performance. In this way, use all the folds where each fold at most 1 time is used as test set.

Then Weka averages the performances and presents that on the output pane.

Upvotes: 1

Related Questions