Reputation: 91
I have a dataset with 283 observation of 60 variables. My outcome variable is dichotomous (Diagnosis) and can be either of two diseases. I am comparing two types of diseases that often show much overlap and i am trying to find the features that can help differentiate these diseases from each other. I understand that LASSO logistic regression is the best solution for this problem, however it can not be run on a incomplete dataset.
So i imputed my missing data with MICE package in R and found that approximately 40 imputations is good for the amount of missing data that i have.
Now i want to perform lasso logistic regression on all my 40 imputed datasets and somehow i am stuck at the part where i need to pool the results of all these 40 datasets.
The with()
function from MICE does not work on .glmnet
# Impute database with missing values using MICE package:
imp<-mice(WMT1, m = 40)
#Fit regular logistic regression on imputed data
imp.fit <- glm.mids(Diagnosis~., data=imp,
family = binomial)
# Pool the results of all the 40 imputed datasets:
summary(pool(imp.fit),2)
The above seems to work fine with logistic regression using glm(), but when i try the exact above to perform Lasso regression i get:
# First perform cross validation to find optimal lambda value:
CV <- cv.glmnet(Diagnosis~., data = imp,
family = "binomial", alpha = 1, nlambda = 100)
When i try to perform cross validation I get this error message:
Error in as.data.frame.default(data) :
cannot coerce class ‘"mids"’ to a data.frame
Can somebody help me with this problem?
Upvotes: 1
Views: 1549
Reputation: 34
A thought: Consider running the analyses on each of the 40 datasets. Then, storing which variables are selected in each in a matrix. Then, setting some threshold (e.g., selected in >50% of datasets).
Upvotes: 0