Reputation: 2170
I have a binomial glm
of a presence/absence response variable and a factor variable with 9 levels like this:
data$y<-factor(data$y,levels=c(0,1),labels=c("absent","present"))
table(data$y,data$site_name)
Andulay Antulang Basak Dauin Poblacion District 1 Guinsuan Kookoo's Nest Lutoban Pier Lutoban South Malatapay Pier
absent 4 4 1 0 3 1 5 5 2
present 2 2 5 6 3 5 1 1 4
model <- glm(y~site_name,data=data,binomial)
Just skipping the model inference and validation for brevity's sake, how do I plot per site a probability of getting "present" in a boxplot with its confidence interval? What I would like is kind of what is shown in Plot predicted probabilities and confidence intervals in R but I would like to show it with a boxplot, as my regression variable site_name is a factor with 9 levels, not a continuous variable.
I think I can calculate the necessary values as follows (but am not 100% sure about the correctness):
Function to convert the model coefficients back to probabilities of success:
calc_val <- function(x){return(round(1/(1+1/(exp(x))),3))}
Predicted probabilities based on the model:
prob <- tapply(predict(model,type="response"),data$site_name,function(x){round(mean(x),3)})
means <- as.data.frame(prob)
75% and 95% confidence intervals for the predicted probabilities:
ci <- cbind(confint(model,level=0.9),confint(model,level=0.5))
rownames(ci) <- gsub("site_name","",rownames(ci))
ci <- t(apply(ci,1,calc_val))
Join it all together in one table
ci<-cbind(means,ci)
ci
prob 5 % 95 % 25 % 75 % Pr(>|z|) stderr
Andulay 0.333 0.091 0.663 0.214 0.469 0.42349216 0.192
Antulang 0.333 0.112 0.888 0.304 0.696 1.00000000 0.192
Basak 0.833 0.548 0.993 0.802 0.964 0.09916496 0.152
Dauin Poblacion District 1 1.000 0.000 NA 0.000 1.000 0.99097988 0.000
Guinsuan 0.500 0.223 0.940 0.474 0.819 0.56032414 0.204
Kookoo's Nest 0.833 0.548 0.993 0.802 0.964 0.09916496 0.152
Lutoban Pier 0.167 0.028 0.788 0.130 0.501 0.51171512 0.152
Lutoban South 0.167 0.028 0.788 0.130 0.501 0.51171512 0.152
Malatapay Pier 0.667 0.364 0.972 0.640 0.903 0.25767454 0.192
So my questions are twofold:
EDIT Here is some sample data via dput
(which also modified the tables above to match the data):
# dput(data[c("y", "site_name")])
data <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("absent", "present"), class = "factor"), site_name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 9L, 9L, 9L, 9L, 9L, 9L, 4L, 4L, 4L, 4L, 4L, 4L, 8L, 8L, 8L, 8L, 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("Andulay", "Antulang", "Basak", "Dauin Poblacion District 1", "Guinsuan", "Kookoo's Nest", "Lutoban Pier", "Lutoban South", "Malatapay Pier"), class = "factor")), .Names = c("y", "site_name"), row.names = c(125L, 123L, 126L, 124L, 128L, 127L, 154L, 159L, 157L, 158L, 156L, 155L, 111L, 114L, 116L, 115L, 112L, 113L, 152L, 151L, 148L, 150L, 153L, 149L, 143L, 146L, 144L, 147L, 142L, 145L, 164L, 165L, 161L, 163L, 160L, 162L, 120L, 122L, 121L, 117L, 118L, 119L, 137L, 136L, 139L, 141L, 140L, 138L, 129L, 134L, 131L, 135L, 133L, 130L), class = "data.frame")
#
Upvotes: 3
Views: 7461
Reputation: 226057
This is a lowest-common-denominator, base-package-only, solution.
Fit the model:
mm <- glm(y~site_name,data=dd,family=binomial)
Make up a prediction frame with the site names:
pframe <- data.frame(site_name=unique(dd$site_name))
Predict (on the logit/linear-predictor scale), with standard errors
pp <- predict(mm,newdata=pframe,se.fit=TRUE)
linkinv <- family(mm)$linkinv ## inverse-link function
Put together the prediction, lower and upper bounds, and back-transform to the probability scale:
pframe$pred0 <- pp$fit
pframe$pred <- linkinv(pp$fit)
alpha <- 0.95
sc <- abs(qnorm((1-alpha)/2)) ## Normal approx. to likelihood
alpha2 <- 0.5
sc2 <- abs(qnorm((1-alpha2)/2)) ## Normal approx. to likelihood
pframe <- transform(pframe,
lwr=linkinv(pred0-sc*pp$se.fit),
upr=linkinv(pred0+sc*pp$se.fit),
lwr2=linkinv(pred0-sc2*pp$se.fit),
upr2=linkinv(pred0+sc2*pp$se.fit))
Plot.
with(pframe,
{
plot(site_name,pred,ylim=c(0,1))
arrows(as.numeric(site_name),lwr,as.numeric(site_name),upr,
angle=90,code=3,length=0.1)
})
As a boxplot:
with(pframe,
{
bxp(list(stats=rbind(lwr,lwr2,pred,upr2,upr),
n = rep(1,nrow(pframe)),
conf = NA,
out = NULL,
group = NULL,
names=as.character(site_name)))
})
There are lots of other ways to do this; I would recommend
library("ggplot2")
ggplot(pframe,aes(site_name,pred))+
geom_pointrange(aes(ymin=lwr,ymax=upr))+
geom_linerange(aes(ymin=lwr2,ymax=upr2),lwd=1.5)+
coord_flip()
An alternative solution is to fit the model via y~site_name-1
, which in this case will assign a separate parameter to the probability of each site, and use profile()
/confint()
to find the confidence intervals; this will be slightly more accurate than relying on the Normality of the sampling distributions of the parameters/predictions as done in the answer above.
Upvotes: 7