Reputation: 174
I am trying to fit a Lasso regression with a cross-validated lambda using glmnet
and caret
package. My code is,
dim(x)
# 121755 465
dim(y)
# 121755 1
### cv.glmnet
set.seed(2108)
cl <- makePSOCKcluster(detectCores()-2,outfile="")
registerDoParallel(cl)
system.time(
las.glm <- cv.glmnet(x=x, y=y,alpha=1,type.measure="mse",parallel = TRUE,
nfolds = 5, lambda = seq(0.001,0.1,by = 0.001),
standardize=F)
)
stopCluster(cl)
# user system elapsed
# 17.98 2.28 37.23
### caret
caretctrl <- trainControl(method = "cv", number = 5)
tune <- expand.grid(alpha=1,lambda = seq(0.001,0.1,by = 0.001))
set.seed(2108)
cl <- makePSOCKcluster(detectCores()-2,outfile="")
registerDoParallel(cl)
system.time(
las.car <- train(x=x, y=as.numeric(y),alpha=1,method="glmnet",
metric="RMSE", allowParallel = TRUE,
trControl = caretctrl, tuneGrid = tune)
)
stopCluster(cl)
# error
Something is wrong; all the RMSE metric values are missing:
RMSE Rsquared MAE
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :100 NA's :100 NA's :100
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Timing stopped at: 3.97 1.37 127.9
I understand that this might be due to not having enough data in one of the resamples, but I doubt that should be an issue with my data size and just 5 folds. I have tried the following solutions that didn't work for me:
allowParallel
when CPU is not multithreadedI reckon that caret
is performing some other resampling which the glmnet
is not performing leading to the error. Can someone shed any light on this problem?
Edit 1
x is a semi-sparse matrix of 210 indicator and 255 continuous variables.
Upvotes: 1
Views: 910
Reputation: 46978
I think most of the problem comes with setting alpha=1 again in train
, using an example. So even if your x,y are sparse, it will work:
library(glmnet)
library(caret)
library(Matrix)
dat = Matrix(as.matrix(mtcars),sparse=TRUE)
x = as.matrix(mtcars[,-1])
y = as.matrix(mtcars[,1])
L = seq(0.001,0.1,by = 0.02)
las.glm <- cv.glmnet(x=x, y=y,alpha=1,type.measure="mse",nfolds = 5, lambda = L,standardize=FALSE)
So cv.glmnet works, now if we try your code, it returns the error:
caretctrl <- trainControl(method = "cv", number = 5)
tune <- expand.grid(alpha=1,lambda = L)
las.car <- train(x=x, y=as.numeric(y),alpha=1,method="glmnet",
metric="RMSE",trControl = caretctrl, tuneGrid = tune)
Something is wrong; all the RMSE metric values are missing:
RMSE Rsquared MAE
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Remove the alpha argument:
las.car <- train(x=x, y=as.numeric(y),method="glmnet",
metric="RMSE",trControl = caretctrl, tuneGrid = tune)
glmnet
32 samples
10 predictors
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 25, 26, 26, 26, 25
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.001 3.798431 0.7689346 3.003005
0.021 3.360426 0.7821630 2.714694
0.041 3.099981 0.7958414 2.543577
0.061 2.842374 0.8066351 2.328833
0.081 2.801421 0.8046289 2.301098
And it will also work with a dense matrix.
Upvotes: 1