How can I create multiple data models, then pick the model(s) that are evaluated to be the best fitting?

Question

I have a training-test function set up in R that takes a set of data, excludes a certain portion of it (to preclude the model overfitting the data), and then trains a linear model on about half of the remaining data before testing the model on the other half.

I should note that set of data are based on PCA scores, which is why the linear model is set to include the seven PCA components.

splitprob = 0.7
trainindex = createDataPartition(scores$y, p=splitprob, list=F)
trainingset = scores[trainindex,]
testset = scores[-trainindex,]
model = glm(y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, data=trainingset)
summary(model)
prediction = predict.lm(model, trainingset, se.fit=T)

Now, what I want to do is run this script multiple times, produce multiple models, and then pick one or more models that will be used to make future predictions. Although I have set up the function to be run a certain number of times, I do not know how to set it up so that I can compare the different models against one another (probably by using the AIC), nor am I sure how I should capture parameters of the model(s) in order to export them to a text file or a .csv file.

I tried implementing the glmulti package, but due to various problems in using Java, rJava, and Mac OsSX, I have been having massive problems in getting it to install properly. Could anyone recommend me another approaches to this problem at all?

lampros · Accepted Answer

I updated (modified) the answer to include different predictors each time,

# build data

set.seed(1)
y = runif(100)
x = matrix(runif(1000), nrow = 100, ncol = 10)

function for repeated train - test splits, which returns summary statistics for all iterations (it takes the data, the number of iterations, a formula, and the split-percentage in train-test),

glm_train_test = function(y, x, num_iters, Formula, split_percentage = 0.5) {

  list_models = list()

  aic_models = matrix(0, nrow = num_iters, ncol = 2)

  rmse_models = matrix(0, nrow = num_iters, ncol = 2)


  for (i in 1:num_iters) {                                        # train - test - splits

    spl = sample(nrow(x), round(nrow(x) * split_percentage), replace = F) 

    train = data.frame(y = y[spl], x[spl, ])

    test = data.frame(y = y[-spl], x[-spl, ])

    tmp_formula = as.formula(Formula)

    fit = glm(formula = tmp_formula, data = train)

    # pred_train = predict(fit, newdata = train)

    pred_test = predict(fit, newdata = test)

    aic_models[i, ] = c(i, summary(fit)$aic)

    rmse_models[i, ] = c(i, Metrics::rmse(y[-spl], pred_test))
  }


  # convert the resulted aic-rmse matrices to data frames

  sort_aic = as.data.frame(aic_models)
  colnames(sort_aic) = c('iteration', 'aic_value')

  sort_rmse = as.data.frame(rmse_models)
  colnames(sort_rmse) = c('iteration', 'rmse_value')

  tmp_aic = c(summary(sort_aic[, 2]), sd(sort_aic[, 2]))
  names(tmp_aic) = c(names(summary(sort_aic[, 2])), 'std.dev')

  tmp_rmse = c(summary(sort_rmse[, 2]), sd(sort_rmse[, 2]))
  names(tmp_rmse) = c(names(summary(sort_rmse[, 2])), 'std.dev')

  return(list(summary_aic = tmp_aic, summary_rmse = tmp_rmse))
}

first model including only three predictors,

first_mod = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 + X2 + X3", split_percentage = 0.5)

example output

first_mod

$summary_aic
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.   std.dev 
-6.193321 10.527612 13.317441 13.320968 17.019571 26.792904  5.798970 

$summary_rmse
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.    std.dev 
0.23862643 0.26628405 0.27568030 0.27713722 0.28616462 0.33223873 0.01730974

second model including also an interaction,

second_model = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 * X4 + X2 + X3", split_percentage = 0.5)

example output for the second model,

second_model

$summary_aic
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max.     std.dev 
-0.04232767 13.37572489 16.92720680 16.81206625 20.79921756 29.67830243  5.96477080 

$summary_rmse
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.    std.dev 
0.23912727 0.27026791 0.28117878 0.28177088 0.29526944 0.32674985 0.01819308

You can use an evaluation metric that is appropriate for your data and you can also have a look to the difference between 'aic' and 'bic' in another stackoverflow question.

How can I create multiple data models, then pick the model(s) that are evaluated to be the best fitting?

Answers (1)

Related Questions