Reputation: 41
I am trying to develop a function that does k-fold cross validation for any linear regression model. I was able to develop the code below
K.CV.2<-function(numberOfFolds, inputData){
index <- sample(1:numberOfFolds, nrow(inputData), replace = T)
inputData$index <- index
mse.all <- c()
for (n in 1:numberOfFolds) {
training.data <- inputData[inputData$index!=n,]
testing.data <- inputData[inputData$index==n,]
model.hold <- lm(earn~height+sex+race+ed, data = training.data)
predictions <- predict(model.hold, newdata=testing.data)
mse.hold <- mean((testing.data$earn-predictions)^2)
mse.all <- c(mse.all,mse.hold)
}
return(mse.all)
}
K.CV.2(numberOfFolds, Inputdata)
I am not satisfied with this function as I am trying to make it more generic. I would like something like function(numberOfFolds, y, x, InputData)
I want to be able to pass the columns of the y
variable and all other columns that I will use as explanatory variables in x
using c()
function.
Can someone please help/guide me in the right direction?
Upvotes: 0
Views: 497
Reputation: 3038
Something like this I guess. Some notes:
T
. It's not a reserved value. You could write T <- FALSE
in your script with predictably hilarious results.x
and y
you want to use. That's what I've done here. You could then loop through a dataframe and pass in subsetted x
and whatever y
you wanted, but it keeps the CV function cleaner.1:n
in code. If n
is zero, it creates the vector c(1, 0)
which often leads to unexpected results.npred <- 10
nobs <- 100
mat <- matrix(rnorm(npred * nobs), ncol = npred, nrow = nobs)
beta <- rnorm(npred, sd = 2)
y <- (mat %*% beta + rnorm(100, sd=0.2))[, 1]
x <- as.data.frame(mat)
cv_lm <- function(x, y, nfolds) {
## don't use T; it's dangerous.
index <- sample(seq_len(nfolds), nrow(x), replace = TRUE)
stopifnot(nrow(x) == length(y)) ## check to make sure the input is valid
## declare output as a vector of length nfolds
mse <- numeric(nfolds)
## use seq_len(nfolds) instead of 1:nfolds
for (n in seq_len(nfolds)) {
## create x and y training/test split
x_training <- x[index != n, ]
x_testing <- x[index == n, ]
y_training <- y[index != n]
y_testing <- y[index == n]
## fit model (y ~ . means "y against everything")
fit <- lm(y_training ~ ., data = x_training)
predictions <- predict(fit, newdata = x_testing)
## don't grow vectors
mse[[n]] <- mean((y_testing-predictions)^2)
}
return(mse)
}
## predict y with x (true predictors)
cv_lm(x = x, y = y, nfolds = 10)
#> [1] 0.07503477 0.02913210 0.03467926 0.03762000 0.03291574 0.05687162
#> [7] 0.01023195 0.06304727 0.02435436 0.06634923
## predict first predictor in x with the other 9 (no true relationship, hence larger MSE)
cv_lm(x = x[, 2:10], y = x[, 1], nfolds = 10)
#> [1] 0.7447689 1.4850093 1.0964575 2.5117445 1.1037217 1.0684408 1.7798122
#> [8] 0.9962255 1.0208430 1.3867716
Upvotes: 1