Debutant
Debutant

Reputation: 357

How to perform repeated k-fold cross validation in R with DAAG package?

I have created a 3-fold linear regression model using the HousePrices data set of DAAG package. I have read some of the threads in here and in Cross Validated and it was mentioned multiple times that the cross validation must be repeated many times (like 50 or 100) for robustness. I'm not sure what it means? Does it mean to simply run the code 50 times and calculate the average of the overall ms?

> cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area+bedrooms),
+       m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
+       main="Small symbols show cross-validation predicted values",
+       legend.pos="topleft", printit = TRUE)
Analysis of Variance Table

Response: sale.price
          Df Sum Sq Mean Sq F value Pr(>F)   
area       1  18566   18566    17.0 0.0014 **
bedrooms   1  17065   17065    15.6 0.0019 **
Residuals 12  13114    1093                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



fold 1 
Observations in test set: 5 
             11  20    21    22  23
Predicted   206 249 259.8 293.3 378
cvpred      204 188 199.3 234.7 262
sale.price  215 255 260.0 293.0 375
CV residual  11  67  60.7  58.3 113

Sum of squares = 24351    Mean square = 4870    n = 5 

fold 2 
Observations in test set: 5 
               10    13    14    17    18
Predicted   220.5 193.6 228.8 236.6 218.0
cvpred      226.1 204.9 232.6 238.8 224.1
sale.price  215.0 112.7 185.0 276.0 260.0
CV residual -11.1 -92.2 -47.6  37.2  35.9

Sum of squares = 13563    Mean square = 2713    n = 5 

fold 3 
Observations in test set: 5 
                9    12    15    16  19
Predicted   190.5 286.3 208.6 193.3 204
cvpred      174.8 312.5 200.8 178.9 194
sale.price  192.0 274.0 212.0 220.0 222
CV residual  17.2 -38.5  11.2  41.1  27

Sum of squares = 4323    Mean square = 865    n = 5 

Overall (Sum over all 5 folds) 
  ms 
2816 

Every time I repeat it I get this same ms=2816. Can someone please explain what exactly it means to repeat the CV 100 times? Because repeating this code 100 times doesn't seem to change the ms.

Upvotes: 0

Views: 664

Answers (1)

sconfluentus
sconfluentus

Reputation: 4993

Repeating this code 100 times will not change anything. You have set a seed which means that your sets are always the same sets, which means with three folds, you will have the same three folds, so all 100 times you will get the same mean square error.

It does not seem like you have enough samples to support 50 or 100 folds would be appropriate. And there is NO set number of folds that is appropriate across all sets of data.

The number of folds should be reasonable such that you have sufficient testing data.

Also, you do not want to run multiple different CV models with different seeds, to try to find the best performing seed, because that form of error hacking is a proxy for overfitting.

You should groom your data well, engineer and transform your variables properly pick a reasonable number of folds, set a seed so your stakeholders can repeat your findings and then build your model.

Upvotes: 1

Related Questions