stollenm
stollenm

Reputation: 358

Is there a simple command to do leave-one-out cross validation with the lm() function?

Is there a simple command to do leave-one-out cross validation with the lm() function in R?

Specifically is there a simple command which for the code below?

x <- rnorm(1000,3,2)
y <- 2*x + rnorm(1000)

pred_error_sq <- c(0)
for(i in 1:1000) {
  x_i <- x[-i]
  y_i <- y[-i]
  mdl <- lm(y_i ~ x_i) # leave i'th observation out
  y_pred <- predict(mdl, data.frame(x_i = x[i])) # predict i'th observation
  pred_error_sq <- pred_error_sq + (y[i] - y_pred)^2 # cumulate squared prediction errors
}

y_squared <- sum((y-mean(y))^2)/100 # Variation of the data

R_squared <- 1 - (pred_error_sq/y_squared) # Measure for goodness of fit

Upvotes: 8

Views: 22787

Answers (5)

pockeystar
pockeystar

Reputation: 31

Just write you own code use an index variable to mark the one observation that is out of sample. Test this method against the highest vote one with caret. Although caret is simple and easy to use, my brutal method takes less time. (instead of lm, I used LDA, but no big difference)

for (index in 1:dim(df)[1]){
   # here write your lm function
}

Upvotes: 1

dgisser
dgisser

Reputation: 152

cv.glm in https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/cv.glm by default does LOOCV and only requires the data and the lm or glm function.

Upvotes: 1

agenis
agenis

Reputation: 8377

You can just use a custom function using a statistical trick that avoids actually computing all the N models:

loocv=function(fit){
  h=lm.influence(fit)$h
  mean((residuals(fit)/(1-h))^2)
}

This is explained in here: https://gerardnico.com/wiki/lang/r/cross_validation It only works with linear models And I guess you might want to add a square root after the mean in the formula.

Upvotes: 8

amarchin
amarchin

Reputation: 2114

Another solution is using caret

library(caret)

data <- data.frame(x = rnorm(1000, 3, 2), y = 2*x + rnorm(1000))

train(y ~ x, method = "lm", data = data, trControl = trainControl(method = "LOOCV"))

Linear Regression

1000 samples 1 predictor

No pre-processing Resampling: Leave-One-Out Cross-Validation Summary of sample sizes: 999, 999, 999, 999, 999, 999, ... Resampling results:

RMSE Rsquared MAE
1.050268 0.940619 0.836808

Tuning parameter 'intercept' was held constant at a value of TRUE

Upvotes: 13

Ale
Ale

Reputation: 996

You can try cv.lm from the DAAG package:

cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area),
              m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
              main="Small symbols show cross-validation predicted values",
              legend.pos="topleft", printit = TRUE)

Arguments

data        a data frame
form.lm,    a formula or lm call or lm object
m           the number of folds
dots        uses pch=16 for the plotting character
seed        random number generator seed
plotit      This can be one of the text strings "Observed", "Residual", or a logical value. The logical TRUE is equivalent to "Observed", while FALSE is equivalent to "" (no plot)
main        main title for graph
legend.pos      position of legend: one of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "center".
printit     if TRUE, output is printed to the screen

Upvotes: 2

Related Questions