user1407875
user1407875

Reputation: 69

Can I do predict.glmnet on test data with different number of predictor variables?

I used glmnet to build a predictive model on a training set with ~200 predictors and 100 samples, for a binomial regression/classification problem.

I selected the best model (16 predictors) that gave me the max AUC. I have an independent test set with only those variables (16 predictors) which made it into the final model from the training set.

Is there any way to use the predict.glmnet based on the optimal model from the training set with new test set which has data for only those variables that made it into the final model from the training set?

Upvotes: 5

Views: 7812

Answers (1)

NiuBiBang
NiuBiBang

Reputation: 628

glmnet requires the exact same number/names of variables from the training dataset to be in the validation/test set. For example:

library(caret)
library(glmnet)
df <- ... # a dataframe with 200 variables, some of which you want to predict on 
      #  & some of which you don't care about.
      # Variable 13 ('Response.Variable') is the dependent variable.
      # Variables 1-12 & 14-113 are the predictor variables
      # All training/testing & validation datasets are derived from this single df.

# Split dataframe into training & testing sets
inTrain <- createDataPartition(df$Response.Variable, p = .75, list = FALSE)
Train <- df[ inTrain, ] # Training dataset for all model development
Test <- df[ -inTrain, ] # Final sample for model validation

# Run logistic regression , using only specified predictor variables 
logCV <- cv.glmnet(x = data.matrix(Train[, c(1:12,14:113)]), y = Train[,13],
family = 'binomial', type.measure = 'auc')

# Test model over final test set, using specified predictor variables
# Create field in dataset that contains predicted values
Test$prob <- predict(logCV,type="response", newx = data.matrix(Test[,   
                     c(1:12,14:113) ]), s = 'lambda.min')

For a completely new set of data, you could constrain the new df to the necessary variables using some variant of the following method:

new.df <- ... # new df w/ 1,000 variables, which include all predictor variables used 
              # in developing the model

# Create object with requisite predictor variable names that we specified in the model
predictvars <- c('PredictorVar1', 'PredictorVar2', 'PredictorVar3', 
                  ... 'PredictorVarK')
new.df$prob <- predict(logCV,type="response", newx = data.matrix(new.df[names(new.df)
                        %in% predictvars ]), s = 'lambda.min')
                       # the above method limits the new df of 1,000 variables to                                                     
                       # whatever the requisite variable names or indices go into the 
                       # model.

Additionally, glmnet only deals with matrices. This is probably why you're getting the error you post in the comment to your question. Some users (myself included) have found that as.matrix() doesn't resolve the issue; data.matrix() seems to work though (hence why it's in the above code). This issue is addressed in a thread or two on SO.

I assume that all variables in the new dataset to be predicted also need to be formatted the same as they were in the dataset used for model development. I usually pull all of my data from the same source so I haven't encountered what glmnet will do in cases where formatting is different.

Upvotes: 3

Related Questions