Reputation: 299
I tried to fit a Regularization model (LASSO, Ridge, ElasticNet) using Leave one out cross-validation split with cv.glmnet and glmnet.
I used this DATASET, the dataset has 517 rows and 13 columns (two of them are categorical variables). The dependent variable is "area".
I would like to build a model without the categorical variables. Then is to calculate the coefficients mean of each LOOCV run, as well as the mean of R-squared and Root mean square.
The last 10 rows of the dataset are the following
tail(wdbc,10)
X Y month day FFMC DMC DC ISI temp RH wind rain area
508 2 4 aug fri 91.0 166.9 752.6 7.1 25.9 41 3.6 0.0 0.00
509 1 2 aug fri 91.0 166.9 752.6 7.1 25.9 41 3.6 0.0 0.00
510 5 4 aug fri 91.0 166.9 752.6 7.1 21.1 71 7.6 1.4 2.17
511 6 5 aug fri 91.0 166.9 752.6 7.1 18.2 62 5.4 0.0 0.43
512 8 6 aug sun 81.6 56.7 665.6 1.9 27.8 35 2.7 0.0 0.00
513 4 3 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 6.44
514 2 4 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 0.0 54.29
515 7 4 aug sun 81.6 56.7 665.6 1.9 21.2 70 6.7 0.0 11.16
516 1 4 aug sat 94.4 146.0 614.7 11.3 25.6 42 4.0 0.0 0.00
517 6 3 nov tue 79.5 3.0 106.7 1.1 11.8 31 4.5 0.0 0.00
My attempted code is as follows
set.seed(123)
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")
data<-data[-(3:4)]
nrFolds <- 517
folds <- rep_len(1:nrFolds, nrow(data))
for(k in 1:nrFolds) {
fold <- which(folds == k)
data.train <- data[-fold,]
data.test <- data[fold,]
x.train <- as.matrix(data.train[-11])
y.train <- as.matrix(data.train[11])
x.test <- as.matrix(data.test[-11])
y.test <- as.matrix(data.test[11])
cv <- cv.glmnet(x.train, y.train, alpha = 0)
# cv$lambda.min
model <- glmnet(x.train, y.train, alpha = 0, lambda = cv$lambda.min)
coef(model)
predictions <- model %>% predict(x.test) %>% as.vector()
RMSE <- RMSE(predictions, data.test$area)
Rsquare <- R2(predictions, data.test$area)
LL <- list(cv,model,coef(model),predictions, RMSE,Rsquare)
}
LL
The code gives me only one prediction value! and R-square has NA value!. Also, I am not sure if I am right to use Leave one out cross-validation split with cv.glmnet and glmnet or not.
Any idea to help is appreciated.
Updated
I tried to use the caret
package
I first split the dataset using LOOCV because I want every single observation to be in test process. Then I used train
function to do my initial idea in the question.
My code
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")
data<-data[-(3:4)]
lambda <- 10^seq(-3, 3, length = 100)
predictions <- rep(0,nrow(data))
set.seed(123)
for(i in 1:nrow(data)){
test.data<-data[i,]
train.data<-data[-i,]
ridge <- train(
area ~., data = train.data, method = "glmnet",
trControl = trainControl("loocv", number = 517),
tuneGrid = expand.grid(alpha = 0, lambda = lambda))
coefs=coef(ridge$finalModel, ridge$bestTune$lambda)
predictions[i] <- ridge %>% predict(test.data)
RMSE = RMSE(predictions, test.data$area)
Rsquare = R2(predictions, test.data$area)
LL<-list(RMSE,Rsquare,coefs,predictions)
}
This code also gives me the following error
Error in cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
incompatible dimensions
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
More Update
I also wrote the following code using caret
package (Ridge for example)
set.seed(123)
data <- read.csv("forestfires.csv")
data<-data[-(3:4)]
lambda.grid <-10^seq(10,-2, length =100)
alpha.grid<- 0 #Ridge
control <- trainControl(method="LOOCV")
srchGrd = expand.grid(alpha = alpha.grid, lambda = lambda.grid)
lm_model <- train(area ~ . , data=data, trControl=control,tuneGrid=srchGrd,method = 'glmnet')
lm_model
coef(lm_model$finalModel, lm_model$bestTune$lambda)
Am I right now?
Upvotes: 0
Views: 1844
Reputation: 17120
OK, so now that you have stated it, the problem is quite obvious. Each time you are in the loop, you assign a new value to the LL. LL doesn't hold all results, just the last one. Try
LL <- lapply(1:nrFolds, function(k) {
fold <- which(folds == k)
data.train <- data[-fold,]
data.test <- data[fold,]
x.train <- as.matrix(data.train[-11])
y.train <- as.matrix(data.train[11])
x.test <- as.matrix(data.test[-11])
y.test <- as.matrix(data.test[11])
cv <- cv.glmnet(x.train, y.train, alpha = 0)
# cv$lambda.min
model <- glmnet(x.train, y.train, alpha = 0, lambda = cv$lambda.min)
coef(model)
predictions <- model %>% predict(x.test) %>% as.vector()
RMSE <- RMSE(predictions, data.test$area)
Rsquare <- R2(predictions, data.test$area)
list(CV=cv, model=model, coefs=coef(model), preds=predictions, rmse=RMSE, r2=Rsquare)
})
However, please look into the caret package: it automatizes creation of folds and testing. Also, I would calculate the RMSE and R² post hoc.
EDIT: Yes, R2 is NA
. This is because in the above code, leave-one-out CV is attempted. rep_len
as above returns the same as 1:517
, so fold
is basically a number between 1 and 517, each data.train
has 516 column and data.test
is a vector. Therefore, predictions
is a vector of length 1 and R2 cannot be calculated.
However, predictions are returned correctly.
preds <- sapply(LL, function(x) x$preds)
Not that they are very good, but the question was about running glmnet and not how to make the best predictions.
EDIT 2: regarding your updated code. Here is what happens: you first initialize the vector predictions
with 517 zeroes. In the first iteration of your loop, you tell caret to make a 517 LOO on a training set that contains 516 samples. Caret then returns the optimized ridge model (plus a lot of information which you ignore, like for example RMSE for each parameter value tested). Then, you take make predictions on your test set, which is one sample. You enter this one single sample in the predictions
vector, which now contains 1 prediction and 516 zeroes. And then you are trying to calculate R2 and RMSE between predictions
(which is a vector of one prediction and 516 zeroes) and the test response (which is a vector of one value). Which, not unexpectedly, fails. Then you store it all in a list called LL
, which will get overwritten next time the loop is being run.
How I would do it: remove 33% of the data as validation. Use the remaining 66% with caret to train a single model (with LOOCV or K-FOLD in caret to optimize the parameters). Inspect the output of caret; make sure you look at the RMSE which caret provides (and R², if you use something else than LOOCV). Then, test the performance of your model on the validation set.
Upvotes: 1