Reputation: 2186
I am trying to understand the difference between two different fitting methods for a data set with a bounded response variable. The response variable is a fraction and therefore has a range of [0,1]. I have uncovered through my Google searching that there are a lot of different methods out there as this is a common operation. I am currently interested in the difference between the stock R GLM fit and the Beta regression offered in the betareg package. I am using the GasolineYield data set from the "betareg" package as my sample data set. Before I post the code and the results my two questions are the following:
Am I performing the Logistic Regression fit in R using the builtin R GLM correctly?
Why are the standard errors reported in the Beta regression so much smaller than the standard errors for the R logistic regression?
R Setup Code
library(betareg)
data("GasolineYield", package = "betareg")
Beta Regression code from the "betareg" package
gy = betareg(yield ~ batch + temp, data = GasolineYield)
summary(gy)
Beta Regression summary output
Call:
betareg(formula = yield ~ batch + temp, data = GasolineYield)
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-2.8750 -0.8149 0.1601 0.8384 2.0483
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.1595710 0.1823247 -33.784 < 2e-16 ***
batch1 1.7277289 0.1012294 17.067 < 2e-16 ***
batch2 1.3225969 0.1179020 11.218 < 2e-16 ***
batch3 1.5723099 0.1161045 13.542 < 2e-16 ***
batch4 1.0597141 0.1023598 10.353 < 2e-16 ***
batch5 1.1337518 0.1035232 10.952 < 2e-16 ***
batch6 1.0401618 0.1060365 9.809 < 2e-16 ***
batch7 0.5436922 0.1091275 4.982 6.29e-07 ***
batch8 0.4959007 0.1089257 4.553 5.30e-06 ***
batch9 0.3857930 0.1185933 3.253 0.00114 **
temp 0.0109669 0.0004126 26.577 < 2e-16 ***
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 440.3 110.0 4.002 6.29e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 84.8 on 12 Df
Pseudo R-squared: 0.9617
Number of iterations: 51 (BFGS) + 3 (Fisher scoring)
R GLM Logistic Regression code from stock R
glmfit = glm(yield ~ batch + temp, data = GasolineYield, family = "binomial")
summary(glmfit)
R GLM Logistic Regression summary output
Call:
glm(formula = yield ~ batch + temp, family = "binomial", data = GasolineYield)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.100459 -0.025272 0.004217 0.032879 0.082113
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.130227 3.831798 -1.600 0.110
batch1 1.720311 2.127205 0.809 0.419
batch2 1.305746 2.481266 0.526 0.599
batch3 1.562343 2.440712 0.640 0.522
batch4 1.048928 2.152385 0.487 0.626
batch5 1.125075 2.176242 0.517 0.605
batch6 1.029601 2.229773 0.462 0.644
batch7 0.540401 2.294474 0.236 0.814
batch8 0.497355 2.288564 0.217 0.828
batch9 0.378315 2.494881 0.152 0.879
temp 0.010906 0.008676 1.257 0.209
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2.34184 on 31 degrees of freedom
Residual deviance: 0.07046 on 21 degrees of freedom
AIC: 36.631
Number of Fisher Scoring iterations: 5
Upvotes: 2
Views: 1303
Reputation: 8252
The standard errors are different because the variance assumptions in the two models are different.
Logistic regression assumes the response has a binomial distribution, while beta regression assumes it has a beta distribution.
The variance functions of the two are different. For the binomial, if you specify the mean (and $n$ is a given) the variance is determined. For the beta there's another free parameter, so it isn't determined by the mean and would presumably be estimated from the data.
This suggests that if you fit a quasibinomial GLM (adding a variance parameter) you might get closer to the same standard errors, but they still won't be the same, since they would weight the observations differently.
What you should actually do:
if your proportions are originally counts divided by some total count, then a binomial GLM would be an appropriate model to consider. (You would need the total counts, though.)
if your proportions are continuous fractions (the proportion of milk that's cream for example), then beta regression is an appropriate model to consider.
Upvotes: 2