Reputation: 358
Is there a simple command to do leave-one-out cross validation with the lm()
function in R?
Specifically is there a simple command which for the code below?
x <- rnorm(1000,3,2)
y <- 2*x + rnorm(1000)
pred_error_sq <- c(0)
for(i in 1:1000) {
x_i <- x[-i]
y_i <- y[-i]
mdl <- lm(y_i ~ x_i) # leave i'th observation out
y_pred <- predict(mdl, data.frame(x_i = x[i])) # predict i'th observation
pred_error_sq <- pred_error_sq + (y[i] - y_pred)^2 # cumulate squared prediction errors
}
y_squared <- sum((y-mean(y))^2)/100 # Variation of the data
R_squared <- 1 - (pred_error_sq/y_squared) # Measure for goodness of fit
Upvotes: 8
Views: 22787
Reputation: 31
Just write you own code use an index variable to mark the one observation that is out of sample. Test this method against the highest vote one with caret. Although caret is simple and easy to use, my brutal method takes less time. (instead of lm, I used LDA, but no big difference)
for (index in 1:dim(df)[1]){
# here write your lm function
}
Upvotes: 1
Reputation: 152
cv.glm
in https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/cv.glm by default does LOOCV and only requires the data and the lm
or glm
function.
Upvotes: 1
Reputation: 8377
You can just use a custom function using a statistical trick that avoids actually computing all the N models:
loocv=function(fit){
h=lm.influence(fit)$h
mean((residuals(fit)/(1-h))^2)
}
This is explained in here: https://gerardnico.com/wiki/lang/r/cross_validation It only works with linear models And I guess you might want to add a square root after the mean in the formula.
Upvotes: 8
Reputation: 2114
Another solution is using caret
library(caret)
data <- data.frame(x = rnorm(1000, 3, 2), y = 2*x + rnorm(1000))
train(y ~ x, method = "lm", data = data, trControl = trainControl(method = "LOOCV"))
Linear Regression
1000 samples 1 predictor
No pre-processing Resampling: Leave-One-Out Cross-Validation Summary of sample sizes: 999, 999, 999, 999, 999, 999, ... Resampling results:
RMSE Rsquared MAE
1.050268 0.940619 0.836808Tuning parameter 'intercept' was held constant at a value of TRUE
Upvotes: 13
Reputation: 996
You can try cv.lm
from the DAAG package:
cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area),
m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
main="Small symbols show cross-validation predicted values",
legend.pos="topleft", printit = TRUE)
Arguments
data a data frame
form.lm, a formula or lm call or lm object
m the number of folds
dots uses pch=16 for the plotting character
seed random number generator seed
plotit This can be one of the text strings "Observed", "Residual", or a logical value. The logical TRUE is equivalent to "Observed", while FALSE is equivalent to "" (no plot)
main main title for graph
legend.pos position of legend: one of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "center".
printit if TRUE, output is printed to the screen
Upvotes: 2