CDoug
CDoug

Reputation: 145

Differences between glm predictions and geom_smooth() using same glm

I am trying to reproduce someone else's work using a probit model. Unfortunately, I don't have much information about their methods, only their starting data and a plot of their model.

When I plot the data in ggplot and fit a line using geom_smooth(method = "glm", ...), I am able to reproduce the prior work. However, when I try to fit (what I think is) an identical model outside of ggplot using glm(), I get different predictions. I feel like I making some silly mistake, but I can't quite pin it down.

Here is a reproducible example:

library(tidyverse)
set.seed(123)

df <- tibble(x = c(0.006, 0.014, 0.025, 0.05, 0.15, 0.3, 0.5),
             y = c(0.4,   0.733, 0.875, 1,    1,    1,   1))


probit_model <- glm(y ~ x, data = df, family = quasibinomial(link = "probit"))

df <- df %>%
  add_row(x = 0.001, y = NA) %>%  # To underline that these models are different
  mutate(y_pred = predict(probit_model, newdata = ., type = "response"))


df %>%
  ggplot(aes(x, y)) +
  geom_point(size = 4) +
  geom_line(aes(y = y_pred), color = "red", lwd = 1) +
  geom_smooth(formula = y ~ x, color = "blue",
              method = "glm", fullrange = TRUE,
              method.args = list(family = quasibinomial(link = "probit"))) +
  scale_x_log10(limits = c(0.001, 1))

And here is the plot it produces. Note that the blue line and the red line describe different fits. I believe they should be the same (ignoring the piecewise nature of the red line), given they use the same model and data.

Plot

I've read quite a few threads in the process of troubleshooting, and many responses suggest that geom_smooth() is not a replacement for modelling. Broadly, I agree. That said, I am explicitly trying to figure out what geom_smooth() is doing here, and then reproduce it outside of ggplot.

My questions are:

Why are these two models different? How is geom_smooth() calling glm()? How can I call glm() myself to reproduce the model that geom_smooth() is using?

Upvotes: 3

Views: 1539

Answers (1)

eipi10
eipi10

Reputation: 93821

The models are actually the same. You can see this if you set, say, xlim(0, 0.1) and remove scale_x_log10. Then you'll see the fits coincide.

I think the behavior you're seeing is because scale_x_log10 performs the axis transformation before any statistical summaries (such as geom_smooth). So, when you run scale_x_log10, geom_smooth is effectively fitting the model y ~ log10(x), rather than y ~ x. If you use coord_trans(x="log10") instead of scale_x_log10, you'll also see that the models coincide, since coord_trans does the transformation after any statistical summaries.

Upvotes: 3

Related Questions