stephr
stephr

Reputation: 113

How can I include both my categorical and numeric predictors in my elastic net model? r

As a note beforehand, I think I should mention that I am working with highly sensitive medical data that is protected by HIPAA. I cannot share real data with dput- it would be illegal to do so. That is why I made a fake dataset and explained my processes to help reproduce the error. I have been trying to estimate an elastic net model in r using glmnet. However, I keep getting an error. I am not sure what is causing it. The error happens when I go to train the data. It sounds like it has something to do with the data type and matrix. I have provided a sample dataset. Then I set the outcomes and certain predictors to be factors. After setting certain variables to be factors, I label them. Next, I create an object with the column names of the predictors I want to use. That object is pred.names.min. Then I partition the data into the training and test data frames. 65% in the training, 35% in the test. With the train control function, I specify a few things I want to have happen with the model- random paraments for lambda and alpha, as well as the leave one out method. I also specify that it is a classification model (categorical outcome). In the last step, I specify the training model. I write my code to tell it to use all of the predictor variables in the pred.names.min object for the trainingset data frame.

library(dplyr)
library(tidyverse)
library(glmnet),0,1,0
library(caret)

#creating sample dataset
df<-data.frame("BMIfactor"=c(1,2,3,2,3,1,2,1,3,2,1,3,1,1,3,2,3,2,1,2,1,3),
"age"=c(0,4,8,1,2,7,4,9,9,2,2,1,8,6,1,2,9,2,2,9,2,1),  
"L_TartaricacidArea"=c(0,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,1),
"Hydroxymethyl_5_furancarboxylicacidArea_2"= 
c(1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1),
"Anhydro_1.5_D_glucitolArea"=
c(8,5,8,6,2,9,2,8,9,4,2,0,4,8,1,2,7,4,9,9,2,2),
"LevoglucosanArea"= 
c(6,2,9,2,8,6,1,8,2,1,2,8,5,8,6,2,9,2,8,9,4,2),
"HexadecanolArea_1"=
c(4,9,2,1,2,9,2,1,6,1,2,6,2,9,2,8,6,1,8,2,1,2),
"EthanolamineArea"=
c(6,4,9,2,1,2,4,6,1,8,2,4,9,2,1,2,9,2,1,6,1,2),
"OxoglutaricacidArea_2"=
c(4,7,8,2,5,2,7,6,9,2,4,6,4,9,2,1,2,4,6,1,8,2),
"AminopentanedioicacidArea_3"=
c(2,5,5,5,2,9,7,5,9,4,4,4,7,8,2,5,2,7,6,9,2,4),
"XylitolArea"=
c(6,8,3,5,1,9,9,6,6,3,7,2,5,5,5,2,9,7,5,9,4,4),
"DL_XyloseArea"=
c(6,9,5,7,2,7,0,1,6,6,3,6,8,3,5,1,9,9,6,6,3,7),
"ErythritolArea"=
c(6,7,4,7,9,2,5,5,8,9,1,6,9,5,7,2,7,0,1,6,6,3),
"hpresponse1"=
c(1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1),
"hpresponse2"=
c(1,0,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,0,1))

#setting variables as factors
df$hpresponse1<-as.factor(df$hpresponse1)
df$hpresponse2<-as.factor(df$hpresponse2)
df$BMIfactor<-as.factor(df$BMIfactor)
df$L_TartaricacidArea<- as.factor(df$L_TartaricacidArea)
df$Hydroxymethyl_5_furancarboxylicacidArea_2<- 
as.factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2)

#labeling factor levels 
df$hpresponse1 <- factor(df$hpresponse1, labels = c("group1.2", "group3.4"))
df$hpresponse2 <- factor(df$hpresponse2, labels = c("group1.2.3", "group4"))
df$L_TartaricacidArea <- factor(df$L_TartaricacidArea, labels =c ("No", 
"Yes"))
df$Hydroxymethyl_5_furancarboxylicacidArea_2 <- 
factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2, labels =c ("No", 
"Yes"))
df$BMIfactor <- factor(df$BMIfactor, labels = c("<40", ">=40and<50", 
">=50"))

#creating list of predictor names
pred.start.min <- which(colnames(df) == "BMIfactor"); pred.start.min
pred.stop.min <- which(colnames(df) == "ErythritolArea"); pred.stop.min
pred.names.min <- colnames(df)[pred.start.min:pred.stop.min]

#partition data into training and test (65%/35%)
set.seed(2)
n=floor(nrow(df)*0.65)
train_ind=sample(seq_len(nrow(df)), size = n)
trainingset=df[train_ind,]
testingset=df[-train_ind,]

#specifying that I want to use the leave one out cross- 
#validation method and 
use "random" as search for elasticnet
tcontrol <- trainControl(method = "LOOCV",
                         search="random",
                         classProbs = TRUE)


#training model
elastic_model1 <- train(as.matrix(trainingset[, 
pred.names.min]), 
trainingset$hpresponse1,
                        data = trainingset,
                        method = "glmnet",
                        trControl = tcontrol)

After I run the last chunk of code, I end up with this error:

Error in { : 
task 1 failed - "error in evaluating the argument 'x' in selecting a 
method for function 'as.matrix': object of invalid type "character" in 
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first 
50)

I tried removing the "as.matrix" arguemtent:

elastic_model1 <- train((trainingset[, pred.names.min]), 
trainingset$hpresponse1,
                        data = trainingset,
                        method = "glmnet",
                        trControl = tcontrol)

It still produces a similar error.

Error in { : 
task 1 failed - "error in evaluating the argument 'x' in selecting a method 
for function 'as.matrix': object of invalid type "character" in 
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first 
50)

When I tried to make none of the predictors factors (but keep outcome as factor), this is the error I get:

Error: At least one of the class levels is not a valid R variable name; This 
will cause errors when class probabilities are generated because the 
variables names will be converted to  X0, X1 . Please use factor levels that 
can be used as valid R variable names  (see ?make.names for help).

How can I fix this? How can I use my predictors (both the numeric and categorical ones) without producing an error?

Upvotes: 1

Views: 1322

Answers (2)

vesper
vesper

Reputation: 1

I had an identical error message when using glmnet while creating my x and y matrices manually (the values were already calculated not using the package they said would be streamlined into using glmnet functions).

I found out that at some point while using commands to convert my dataframe to array/matrix using something like:

x <- as.matrix.data.frame(my_x_dataframe)

The data was forced into all characters.

To get rid of the error, after converting it as written above, I used:

my_x_dataframe = apply(my_x_dataframe, 2, FUN = function(y){as.numeric(y)}) 

to convert it back to numeric. (I removed all my factor variables as reading through people's notes that glmnet cannot really handle them).

After this, I did it for my y dataset as well (I was running cox, so I had x and y datasets), and the error message disappeared.

Upvotes: 0

r_user
r_user

Reputation: 374

glmnet does not handle factors well. The recommendation currently is to dummy code and re-code to numeric where possible: Using LASSO in R with categorical variables

Upvotes: 0

Related Questions