pdhami
pdhami

Reputation: 197

Prediction on new data with GLMNET and CARET - The number of variables in newx must be X

I have a dataset with which I am doing k-folds cross-validation with.

In each fold, I have split the data into a train and test dataset.

For the training on the dataset X, I run the following code:

cv_glmnet <- caret::train(x = as.data.frame(X[curtrainfoldi, ]), y = y[curtrainfoldi, ],
                       method = "glmnet",
                       preProcess = NULL,
                       trControl = trainControl(method = "cv", number = 10),
                       tuneLength = 10)
    
   

I check the class of 'cv_glmnet', and 'train' is returned.

I then want to use this model to predict values in the test dataset, which is a matrix that has the same number of variables (columns)

# predicting on test data 
yhat <- predict.train(cv_glmnet, newdata = X[curtestfoldi, ])   

However, I keep running into the following error:

Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt, type = "response") : 
  The number of variables in newx must be 210

I noticed in the caret.predict documentation, it states the following:

newdata an optional set of data to predict on. If NULL, then the original training data are used but, if the train model used a recipe, an error will occur.

I am confused as to why am I running into this error. Is it related to how I am defining newdata? My data has the right number of variables/columns (same as the train dataset), so I have no idea what is causing the error.

Upvotes: 1

Views: 3515

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

You get the error because your column names changes when you pass as.data.frame(X). If your matrix doesn't have column names, it creates column names and the model expects these when it tries to predict. If it has column names, then some of them could be changed :

library(caret)
library(tibble)

X =  matrix(runif(50*20),ncol=20)
y = rnorm(50)

cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
                       method = "glmnet",
                       preProcess = NULL,
                       trControl = trainControl(method = "cv", number = 10),
                       tuneLength = 10)

yhat <- predict.train(cv_glmnet, newdata = X) 

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt) : 
  The number of variables in newx must be 20 

If you have column names, it works

colnames(X) = paste0("column",1:ncol(X))
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
                       method = "glmnet",
                       preProcess = NULL,
                       trControl = trainControl(method = "cv", number = 10),
                       tuneLength = 10)

yhat <- predict.train(cv_glmnet, newdata = X)

Upvotes: 2

Related Questions