Reputation: 35

ROC for Logistic regression in R

I would like to ask for help with my project. My goal is to get ROC curve from existing logistic regression.

First of all, here is what I'm analyzing.

glm.fit <- glm(Severity_Binary ~ Side + State + Timezone + Temperature.F. + Wind_Chill.F. + Humidity... + Pressure.in. + Visibility.mi. + Wind_Direction + Wind_Speed.mph. + Precipitation.in. + Amenity + Bump + Crossing + Give_Way + Junction + No_Exit + Railway + Station + Stop + Traffic_Calming + Traffic_Signal + Sunrise_Sunset , data = train_data, family = binomial)

glm.probs <- predict(glm.fit,type = "response")

glm.probs = predict(glm.fit, newdata = test_data, type = "response")
glm.pred = ifelse(glm.probs > 0.5, "1", "0")

This part works fine, I am able to show a table of prediction and mean result. But here comes the problem for me, I'm using pROC library, but I am open to use anything else which you can help me with. I'm using test_data with approximately 975 rows, but variable proc has only 3 sensitivities/specificities values.

library(pROC)
proc <- roc(test_data$Severity_Binary,glm.probs) 

test_data$sens <- proc$sensitivities[1:975] 
test_data$spec <- proc$specificities[1:975]

ggplot(test_data, aes(x=spec, y=sens)) + geom_line()

Here´s what I have as a result:

enter image description here

With Warning message:

Removed 972 row(s) containing missing values (geom_path).

As I found out, proc has only 3 values as I said.

enter image description here

Upvotes: 2

Answers (2)

pholzm

Reputation: 1784

If you consider what the ROC curve does, there is no reason to expect it to have the same dimensions as your dataframe. It provides summary statistics of your model performance (sensitivity, specificity) evaluated on your dataset for different thresholds in your prediction.

Usually you would expect some more nuance on the curve (more than the 3 datapoints at thresholds -Inf, 0.5, Inf). You can look at the distribution of your glm.probs - this ROC curve indicates that all predictions are either 0 or 1, with very little inbetween (hence only one threshold at 0.5 on your curve). [This could also mean that you unintentially used your binary glm.pred for calculating the ROC curve, and not glm.probs as shown in the question (?)]

This seems to be more an issue with your model than with your code - here an example from a random different dataset, using the same steps you took (glm(..., family = binomial, predict(, type = "response"). This produces a ROC curve with 333 steps for ~1300 datapoints.

PS: (Ingore the fact that this is evaluated on training data, the point is the code looks alright up to the point of generating the ROC curve)

m1 <- glm(survived ~ passengerClass + sex + age, data = dftitanic, family = binomial)
myroc <- roc(dftitanic$survived,predict(m1, dftitanic, type = "response")) 

plot(myroc)

Upvotes: 0

Calimo

Reputation: 7969

You can't (and shouldn't) assign the sensitivity and specificity to the data. They are summary data and exist in a different dimension than your data.

Specifically, these two lines are wrong and make no sense at all:

test_data$sens <- proc$sensitivities[1:975] 
test_data$spec <- proc$specificities[1:975]

Instead you must either save them to a new data.frame, or use some of the existing functions like ggroc:

ggroc(proc)

Upvotes: 2

ROC for Logistic regression in R

Answers (2)

Related Questions