Reputation: 31
I've called a multiple logistic regression as follows:
step_1 <- glm(CD3LR ~ alb + surg + ASA_opt + BMI + bil + Hb_cat + MDRD + sex + DM
+ age + Path + Smoking,
na.action = na.exclude, family = binomial)
When I want to test the model by creating a ROC curve, I use the pROC package and call:
roc(CD3LR, step_1$fitted.values, plot=FALSE)
However this returns the error:
> roc(CD3LR, step_1fitted.values, plot=FALSE)
Setting levels:control=0,case=1
Error in roc.default(CD3LR,step_1fitted.values, plot = FALSE) :
Response and predictor must be vectors of the same length.
I think this is because there are 3 missing values for the MDRD
variable and because I've used na.exclude
this results in 3 fewer step_1$fitted.values
than I have for CD3LR
Is there a way to generate a ROC curve using only the CD3LR
values that correspond to the step_1$fitted.values
??
Very grateful for any help!
Upvotes: 3
Views: 4946
Reputation: 46968
I notice you have all your variables in the environment, for example like this, and below I introduce 3NA for sex and 2 NA for BMI,
CD3LR = as.numeric(runif(100)>0.5)
alb = rnorm(100)
surg = sample(1:3,100,replace=TRUE)
ASA_opt = rpois(100,50)
BMI = c(NA,NA,rpois(98,100))
bil = rnorm(100)
Hb_cat = sample(1:3,100,replace=TRUE)
MDRD = runif(100)
sex = c(sample(c("M","F"),98,replace=TRUE),NA,NA)
DM = rnorm(100)
age = sample(20:60,100,replace=TRUE)
Path = rnorm(100)
Smoking = sample(c("Yes","NI"),100,replace=TRUE)
So best is to put all of them into a data.frame, do the fit and then the roc curve:
DataFrame = data.frame(
CD3LR,alb,surg,ASA_opt,BMI,bil,Hb_cat,
MDRD,sex,DM,age,Path,Smoking)
step_1<-glm(CD3LR~alb+surg+ASA_opt+BMI+bil+Hb_cat+MDRD+sex+DM+age+Path+Smoking,
data=DataFrame,na.action=na.exclude,family=binomial)
Thanks to @Calimo for pointing it out, you can also fit with:
step_1<-glm(CD3LR~.,
data=DataFrame,na.action=na.exclude,family=binomial)
This gives you the error:
roc(DataFrame$CD3LR,step_1$fitted.values,plot=FALSE)
We can do:
roc(step_1$y,step_1$fitted.values,plot=FALSE)
Or:
roc(DataFrame[complete.cases(DataFrame),"CD3LR"],step_1$fitted.values,plot=FALSE)
The reason for having a data frame is that you can immediately see the complete set of data which is non NA for all. If you have any NAs in the predictor or response, it will not be used in the regression.
Upvotes: 1
Reputation: 7969
Your intuition is correct, some values have been filtered out by na.action=na.exclude
.
Typically I would recommend using the predict
function to obtain new predictions on the data. They look like this:
> predict(step_1)
1 2 3 4 5 6 7 ...
NA NA 1.04059269 0.60248768 0.81502210 0.23992288 0.08421514 ...
As you can see, the missing values in MDRD produce an NA
, rather than being removed.
You can then feed these predictions directly to the roc function:
roc(CD3LR, predict(step_1))
Upvotes: 2