xmisx
xmisx

Reputation: 55

Implementing Monte Carlo Cross Validation on linear regression in R

I'm having a dataset of 90 stations with a variety of different covariates which I would like to take for prediction by using a step-wise forward multiple regression. Therefore I would like to use Monte Carlo Cross Validation to estimate the performance of my linear model by splitting into test- and training tests for many times. How can I implement the MCCV in R to test my model for certain iterations? I found the package WilcoxCV which gives me the observation number for each iteration. I also found the CMA-package which doesn't helps me a lot so far. I checked all threads about MCCV but didn't find the answer.

Upvotes: 0

Views: 2619

Answers (1)

while
while

Reputation: 3772

You can use the caret package. The MCCV is called 'LGOCV' in this package (i.e Leave Group Out CV). It randomly selects splits between training and test sets.

Here is an example use training a L1-regularized regression model (you should look into regularization instead of step-wise btw), validating the selection of the penalizing lambda parameter using MCCV:

library(caret)
library(glmnet)

n <- 1000 # nbr of observations
m <- 20   # nbr of features

# Generate example data
x <- matrix(rnorm(m*n),n,m)
colnames(x) <- paste0("var",1:m)
y <- rnorm(n)
dat <- as.data.frame(cbind(y,x))

# Set up training settings object
trControl <- trainControl(method = "LGOCV", # Leave Group Out CV (MCCV)
                          number = 10)      # Number of folds/iterations

# Set up grid of parameters to test
params = expand.grid(alpha=c(0,0.5,1),   # L1 & L2 mixing parameter
                     lambda=2^seq(1,-10, by=-0.3)) # regularization parameter

# Run training over tuneGrid and select best model
glmnet.obj <- train(y ~ .,                 # model formula (. means all features)
                    data = dat,            # data.frame containing training set
                    method = "glmnet",     # model to use
                    trControl = trControl, # set training settings
                    tuneGrid = params)     # set grid of params to test over

# Plot performance for different params
plot(glmnet.obj, xTrans=log, xlab="log(lambda)")

# Plot regularization paths for the best model
plot(glmnet.obj$finalModel, xvar="lambda", label=T)

You can use glmnet to train linear models. If you want to use step-wise caret supports that too using e.g method = 'glmStepAIC' or similar.

a list of the feature selection wrappers can be found here: http://topepo.github.io/caret/Feature_Selection_Wrapper.html

Edit

alphaand lambda arguments in the expand.grid function are glmnet specific parameters. If you use another model it will have a different set of parameters to optimize over.

lambda is the amount of regularization, i.e the amount of penalization on the beta values. Larger values will give "simpler" models, less prone to overfit, and smaller values more complex models that will tend to overfit if not enough data is available. The lambda values I supplied are just an example. Supply the grid you are interested in. But in general it is nice to supply an exponentially decreasing sequence for lambda.

alpha is the mixing parameter between L1 and L2 regularization. alpha=1 is L1 and alpha=0 is L2 regularization. I only supplied one value in the grid for this parameter. It is of course possible to supply several, like e.g alpha=c(0,0.5,1) which would test L1, L2 and an even mix of the two.

expand.grid creates a grid of potential parameter values we want to run the MCCV procedure over. Essentially, the MCCV procedure will evaluate performance for each of the different values in the grid and select the best one for you.

You can read more about glmnet, caret and parameter tuning here:

Upvotes: 3

Related Questions