Mo.ms
Mo.ms

Reputation: 43

how to Plot the results of a logistic regression model using base R and ggplot

**creat a new data frame and add a binary column called surv24** 

leukemia.data <- data.frame(wbc = leuk$wbc, ag = leuk$ag, time = leuk$time, surv24 =ifelse(leuk$time>=24, 1,0)) 


    Wbc     ag      time    surv24
1   2300    present  65      1
2   750     present  156     1
3   4300    present  100     1
4   2600    present  134     1
5   6000    present  16      0
6   10500   present  108     1
7    4000   absent   17      0




**logistic regression model**

logistic.model <- glm(surv24 ~ log.wbc + ag, family='binomial', data=leukemia.data)
summary(logistic.model)

fit <- predict(logistic.model, type='response')

leuki <- data.frame (cbind(leukemia.data, fit))

ag_present <- subset(leuki[leuki$ag=='present',])
ag_absent <- subset(leuki[leuki$ag=='absent',])


plot(surv24 ~ wbc, data = leukemia.data, main = "Survival Probablity vs Number of White blood cells",  xlab = "Number of White blood cells", ylab = "Surviavla Probablity")
lines(ag_present$wbc, ag_present$fit, col='red')
lines(ag_absent$wbc, ag_absent$fit, col='green')
legend(0.8,85000, legend =c("Simple Linear Regression Model Predictions","Quadratic Regression Model Predictions"), col = c("green","red"), lty = 1:2, cex=0.7)

This the output I'm getting for the plot enter image description here

This is an example of the graph I want to construct
enter image description here

Note that I want to use both (base R) and (ggplot) to construct this graph

Upvotes: 2

Views: 3685

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173793

It's not really possible to give a working example as an answer without a bit more sample data than this. I'll therefore attempt to reverse-engineer your data set:

set.seed(123)

wbc     <- rnorm(100, 9)
logodds <- c(8 - wbc[1:50], 10 - wbc[51:100])
probs   <- exp(logodds)/(1 + exp(logodds))
surv24  <- rbinom(100, 1, probs)
ag      <- rep(c("absent", "present"), each = 50)
wbc     <- 50 * round(wbc * 20)
leukemia.data <- data.frame(wbc, ag, time = sample(200), surv24)[sample(100),]

head(leukemia.data)
#>     wbc      ag time surv24
#> 24 8250  absent  133      1
#> 39 8700  absent   63      0
#> 47 8600  absent   76      0
#> 60 9200 present   23      0
#> 63 8650 present   39      1
#> 37 9550  absent  104      0

This looks fairly close. Now we will create your model. It is not clear why you are taking the log of the white cell count: this distorts your final result and isn't necessary, so we will run the model just with white cell count:

logistic.model <- glm(surv24 ~ wbc + ag, family = 'binomial', data = leukemia.data)

summary(logistic.model)
#> 
#> Call:
#> glm(formula = surv24 ~ wbc + ag, family = "binomial", data = leukemia.data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.0564  -0.8820   0.4460   0.8258   1.9888  
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  5.7107385  2.4790234   2.304  0.02124 *  
#> wbc         -0.0007356  0.0002790  -2.636  0.00839 ** 
#> agpresent    2.1593304  0.4922130   4.387 1.15e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 138.47  on 99  degrees of freedom
#> Residual deviance: 110.56  on 97  degrees of freedom
#> AIC: 116.56
#> 
#> Number of Fisher Scoring iterations: 4

To generate nice logistic lines, we need to create a dummy dataset containing samples along the range of the x values we wish to plot:

dummy_df <- data.frame(ag = rep(c("present", "absent"), each = 151),
                       wbc = rep(seq(0, 15000, 100), 2))

dummy_df$surv24 <- predict(logistic.model, newdata = dummy_df, type = "response")

Now we can plot. It's actually far simpler to do this with ggplot:

library(ggplot2)

ggplot(leukemia.data, aes(wbc, surv24, color = ag)) + 
  geom_point() +
  geom_line(data = dummy_df) +
  lims(x = c(0, 15000))

However, to recreate your target plot in base R graphics, you could do something like:

plot(dummy_df$wbc[1:151]/1000, dummy_df$surv24[1:151], 
     type = "l", col = "green", ylim = c(0, 1),
     ylab = "Probability of death prior to 24 weeks",
     xlab = "WBC counts", bty = "L")
lines(dummy_df$wbc[152:302]/1000, dummy_df$surv24[152:302], col = "black")
with(leukemia.data, points(wbc[ag == "present"]/1000, 
                           surv24[ag == "present"], col = "red"))
with(leukemia.data, points(wbc[ag == "absent"]/1000, 
                           surv24[ag == "absent"], col = "black"))
legend("topright", legend = c("absent", "present"), title = "AG test",
       lty = c(1, 1), col = c("black", "green"), pch = c(1, 2), bty = "L")

Now for the main caveat: since you already have the raw survival times, you should probably run this as a survival analysis, not as logistic regression, since you have lost a lot of statistical power by converting to a binary outcome.

Upvotes: 4

Related Questions