Reputation: 197
I have a dataset with which I am doing k-folds cross-validation with.
In each fold, I have split the data into a train and test dataset.
For the training on the dataset X, I run the following code:
cv_glmnet <- caret::train(x = as.data.frame(X[curtrainfoldi, ]), y = y[curtrainfoldi, ],
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
I check the class of 'cv_glmnet', and 'train' is returned.
I then want to use this model to predict values in the test dataset, which is a matrix that has the same number of variables (columns)
# predicting on test data
yhat <- predict.train(cv_glmnet, newdata = X[curtestfoldi, ])
However, I keep running into the following error:
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt, type = "response") :
The number of variables in newx must be 210
I noticed in the caret.predict documentation, it states the following:
newdata an optional set of data to predict on. If NULL, then the original training data are used but, if the train model used a recipe, an error will occur.
I am confused as to why am I running into this error. Is it related to how I am defining newdata? My data has the right number of variables/columns (same as the train dataset), so I have no idea what is causing the error.
Upvotes: 1
Views: 3515
Reputation: 46978
You get the error because your column names changes when you pass as.data.frame(X)
. If your matrix doesn't have column names, it creates column names and the model expects these when it tries to predict. If it has column names, then some of them could be changed :
library(caret)
library(tibble)
X = matrix(runif(50*20),ncol=20)
y = rnorm(50)
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt) :
The number of variables in newx must be 20
If you have column names, it works
colnames(X) = paste0("column",1:ncol(X))
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)
Upvotes: 2