Tareq
Tareq

Reputation: 41

How can I make a function that does K-fold cross validation for linear models?

I am trying to develop a function that does k-fold cross validation for any linear regression model. I was able to develop the code below

K.CV.2<-function(numberOfFolds, inputData){
  index <- sample(1:numberOfFolds, nrow(inputData), replace = T)
  inputData$index <- index
  mse.all <- c()
  for (n in 1:numberOfFolds) {
    training.data <- inputData[inputData$index!=n,]
    testing.data <- inputData[inputData$index==n,]
    model.hold <- lm(earn~height+sex+race+ed, data = training.data)
    predictions <- predict(model.hold, newdata=testing.data)
    mse.hold <- mean((testing.data$earn-predictions)^2)
    mse.all <- c(mse.all,mse.hold)
  }
  return(mse.all)
}

K.CV.2(numberOfFolds, Inputdata)

I am not satisfied with this function as I am trying to make it more generic. I would like something like function(numberOfFolds, y, x, InputData)

I want to be able to pass the columns of the y variable and all other columns that I will use as explanatory variables in x using c() function.

Can someone please help/guide me in the right direction?

Upvotes: 0

Views: 497

Answers (1)

alan ocallaghan
alan ocallaghan

Reputation: 3038

Something like this I guess. Some notes:

  • It's easier to help when you provide example data. I've generated some fake data for this case but something to keep in mind in future. Generally I strongly advise you practice simulating data to test your models and assumptions.
  • Don't use T. It's not a reserved value. You could write T <- FALSE in your script with predictably hilarious results.
  • It would be easier to simply pass in the x and y you want to use. That's what I've done here. You could then loop through a dataframe and pass in subsetted x and whatever y you wanted, but it keeps the CV function cleaner.
  • Don't grow objects unless absolutely necessary. See the R inferno book.
  • It's good practice to avoid 1:n in code. If n is zero, it creates the vector c(1, 0) which often leads to unexpected results.
npred <- 10
nobs <- 100
mat <- matrix(rnorm(npred * nobs), ncol = npred, nrow = nobs)
beta <- rnorm(npred, sd = 2)
y <- (mat %*% beta + rnorm(100, sd=0.2))[, 1]
x <- as.data.frame(mat)

cv_lm <- function(x, y, nfolds) {
  ## don't use T; it's dangerous.
  index <- sample(seq_len(nfolds), nrow(x), replace = TRUE)
  stopifnot(nrow(x) == length(y)) ## check to make sure the input is valid

  ## declare output as a vector of length nfolds
  mse <- numeric(nfolds)
  ## use seq_len(nfolds) instead of 1:nfolds
  for (n in seq_len(nfolds)) {
    ## create x and y training/test split
    x_training <- x[index != n, ]
    x_testing <- x[index == n, ]
    y_training <- y[index != n]
    y_testing <- y[index == n]
    ## fit model (y ~ . means "y against everything")
    fit <- lm(y_training ~ ., data = x_training)
    predictions <- predict(fit, newdata = x_testing)
    ## don't grow vectors
    mse[[n]] <- mean((y_testing-predictions)^2)
  }
  return(mse)
}

## predict y with x (true predictors)
cv_lm(x = x, y = y, nfolds = 10)
#>  [1] 0.07503477 0.02913210 0.03467926 0.03762000 0.03291574 0.05687162
#>  [7] 0.01023195 0.06304727 0.02435436 0.06634923
## predict first predictor in x with the other 9 (no true relationship, hence larger MSE)
cv_lm(x = x[, 2:10], y = x[, 1], nfolds = 10)
#>  [1] 0.7447689 1.4850093 1.0964575 2.5117445 1.1037217 1.0684408 1.7798122
#>  [8] 0.9962255 1.0208430 1.3867716

Upvotes: 1

Related Questions