superbot
superbot

Reputation: 461

Boxplot not showing range

I have predicted values, via:

glm0 <- glm(use ~ as.factor(decision), data = decision_use, family = binomial(link = "logit"))

predicted_glm <- predict(glm0, newdata = decision_use, type = "response", interval = "confidence", se = TRUE)

predict <- predicted_glm$fit
predict <- predict + 1

head(predict)
        1         2         3         4         5         6 
0.3715847 0.3095335 0.3095335 0.3095335 0.3095335 0.5000000 

Now when I plot a box plot using ggplot2,

ggplot(decision_use, aes(x = decision, y = predict)) +
  geom_boxplot(aes(fill = factor(decision)), alpha = .2)

I get a box plot with one horizontal line per categorical variable. If you look at the predict data, it's same for each categorical variable, so makes sense.

But I want a box plot with the range. How can I get that? When I use "use" instead of predict, I get boxes stretching from end to end (1 to 0). So I suppose that's not it. Thank you in advance.

To clarify, predicted_glm includes se.fit values. I wonder how to incorporate those.

Upvotes: 0

Views: 541

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 174393

It doesn't really make sense to do a boxplot here. A boxplot shows the range and spread of a continuous variable within groups. Your dependent variable is binary, so the values are all 0 or 1. Since you are plotting predictions for each group, your plot would have just a single point representing the expected value (i.e. the probability) for each group.

The closest you can come is probably to plot the prediction with 95% confidence bars around it.

You haven't provided any sample data, so I'll make some up here:

set.seed(100)
df <- data.frame(outcome = rbinom(200, 1, c(0.1, 0.9)), var1 = rep(c("A", "B"), 100))

Now we'll create our model and get the prediction for each level of my predictor variable using the newdata parameter of predict. I'm going to specify type = "link" because I want the log odds, and I'm also going to specify se.fit = TRUE so I can get the standard error of these predictions:

mod <- glm(outcome ~ var1, data = df, family = binomial)
prediction <- predict(mod, list(var1 = c("A", "B")), se.fit = TRUE, type = "link")

Now I can work out the 95% confidence intervals for my predictions:

prediction$lower <- prediction$fit - prediction$se.fit * 1.96
prediction$upper <- prediction$fit + prediction$se.fit * 1.96

Finally, I transform the fit and confidence intervals from log odds into probabilities:

prediction <- lapply(prediction, function(logodds) exp(logodds)/(1 + exp(logodds)))

plotdf <- data.frame(Group = c("A", "B"), fit = prediction$fit,
                     upper = prediction$upper, lower = prediction$lower)
plotdf
#>   Group  fit     upper      lower
#> 1     A 0.13 0.2111260 0.07700412
#> 2     B 0.92 0.9594884 0.84811360

Now I am ready to plot. I will use geom_points for the probability estimates and geom_errorbars for the confidence intervals :

library(ggplot2)

ggplot(plotdf, aes(x = Group, y = fit, colour = Group)) +
  geom_errorbar(aes(ymin = lower, ymax = upper), size = 2, width = 0.5) +
  geom_point(size = 3, colour = "black") + 
  scale_y_continuous(limits = c(0, 1)) + 
  labs(title = "Probability estimate with 95% CI", y = "Probability")

Created on 2020-05-11 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions