Reputation: 145
I am trying to reproduce someone else's work using a probit model. Unfortunately, I don't have much information about their methods, only their starting data and a plot of their model.
When I plot the data in ggplot and fit a line using geom_smooth(method = "glm", ...)
, I am able to reproduce the prior work. However, when I try to fit (what I think is) an identical model outside of ggplot using glm()
, I get different predictions. I feel like I making some silly mistake, but I can't quite pin it down.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
df <- tibble(x = c(0.006, 0.014, 0.025, 0.05, 0.15, 0.3, 0.5),
y = c(0.4, 0.733, 0.875, 1, 1, 1, 1))
probit_model <- glm(y ~ x, data = df, family = quasibinomial(link = "probit"))
df <- df %>%
add_row(x = 0.001, y = NA) %>% # To underline that these models are different
mutate(y_pred = predict(probit_model, newdata = ., type = "response"))
df %>%
ggplot(aes(x, y)) +
geom_point(size = 4) +
geom_line(aes(y = y_pred), color = "red", lwd = 1) +
geom_smooth(formula = y ~ x, color = "blue",
method = "glm", fullrange = TRUE,
method.args = list(family = quasibinomial(link = "probit"))) +
scale_x_log10(limits = c(0.001, 1))
And here is the plot it produces. Note that the blue line and the red line describe different fits. I believe they should be the same (ignoring the piecewise nature of the red line), given they use the same model and data.
I've read quite a few threads in the process of troubleshooting, and many responses suggest that geom_smooth()
is not a replacement for modelling. Broadly, I agree. That said, I am explicitly trying to figure out what geom_smooth()
is doing here, and then reproduce it outside of ggplot.
My questions are:
Why are these two models different? How is geom_smooth()
calling glm()
? How can I call glm()
myself to reproduce the model that geom_smooth()
is using?
Upvotes: 3
Views: 1539
Reputation: 93821
The models are actually the same. You can see this if you set, say, xlim(0, 0.1)
and remove scale_x_log10
. Then you'll see the fits coincide.
I think the behavior you're seeing is because scale_x_log10
performs the axis transformation before any statistical summaries (such as geom_smooth
). So, when you run scale_x_log10
, geom_smooth
is effectively fitting the model y ~ log10(x)
, rather than y ~ x
. If you use coord_trans(x="log10")
instead of scale_x_log10
, you'll also see that the models coincide, since coord_trans
does the transformation after any statistical summaries.
Upvotes: 3