Reputation: 11
I want to do a regression with count data model, where doctor visits is the dependent variable. I did a two-part model with first a probit model for no doctor visit at all or one or more and then a Poisson model for at least one doctor visit. After that, I did a hurdle model as a robustness check, because as far as I know, I should get very similar values for both approaches. I do get nearly the same values for the probit part. I get, however, very different values for the Poisson part. Does anyone have any idea why? Here are the commands I used:
probit_doc <- glm(docbin ~ phi + gender + age + health + educ + smoke + logthinc +
wave + AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE +
SWI + BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST +
CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM,
data=allwaves, family=binomial(link="probit"))
poisson_doc <- glm(I(doc > 0) ~ phi + gender + age + health + educ + smoke +
logthinc + wave + AUS + GER + SWE + NED + ESP + ITA + FRA +
DEN + GRE + SWI + BEL + ISR + CZE + POL + LUX + HUN + POR +
SVN + EST + CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM,
data=allwaves, family="poisson")
hd_doc <- hurdle(doc ~ phi + gender + age + educ + smoke + logthinc + wave +
AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI +
BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST + CRO +
LIT + BUL + CYP + FIN + LVA + MAL + ROM | phi + gender +
age + health + educ + smoke + logthinc + wave + AUS + GER +
SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI + BEL + ISR +
CZE + POL + LUX + HUN + POR + SVN + EST + CRO + LIT + BUL +
CYP + FIN + LVA + MAL + ROM,
dist="poisson", data=allwaves, zero.dist="binomial", link="probit")
Upvotes: 1
Views: 469
Reputation: 72758
In a hurdle model we usually calculate a zero part using a binomial
model for the "hurdle", and a count part for what happens after the "hurdle" was passed. Accordingly, in the zero part we binarize the outcome on if it is unequal to zero, as you did correctly.
In the count part however, only the observations that have passed the hurdle are taken into account. So rather than manipulating the outcome variable as we did in the zero model, we want to subset
the data to observations that have values unequal to zero.
Consider this example from the pscl
package which you are obviously using.
library(pscl)
hurdle(art ~ ., dist="poisson", zero.dist="binomial", link="probit", data=bioChemists)$coe
# $count
# (Intercept) femWomen marMarried kid5 phd ment
# 0.67113931 -0.22858266 0.09648499 -0.14218756 -0.01272637 0.01874548
#
# $zero
# (Intercept) femWomen marMarried kid5 phd ment
# 0.15420081 -0.14616546 0.19834717 -0.17380191 0.01864404 0.04433777
We may replicate the zero part by binarizing the outcome variable I(art > 0)
.
## zero model
glm(I(art > 0) ~ ., family=binomial(link='probit'), bioChemists)$coe
# (Intercept) femWomen marMarried kid5 phd ment
# 0.15420081 -0.14616546 0.19834717 -0.17380191 0.01864405 0.04433781
For the count part, if we binarize the outcome variable I(art > 0)
we get wrong results.
## count model misspecified
glm(I(art > 0) ~ ., family=poisson(), bioChemists)$coe
# (Intercept) femWomen marMarried kid5 phd ment
# -0.52364397 -0.07020092 0.08944930 -0.08814266 0.01987790 0.01258345
Instead, we want to subset
the data on observations where the outcome is non-zero, and the values already better resemble those of the hurdle
.
## count model correctly specified
glm(art ~ ., family=poisson(), subset(bioChemists, art > 0))$coe
# (Intercept) femWomen marMarried kid5 phd ment
# 0.82722135 -0.16220935 0.06824070 -0.09902318 -0.01311912 0.01504273
However, as you actually stated in your title, hurdle
is calculating a truncated poisson model, which is also known as positive poisson distribution whereas you used a conventional poisson in the glm
.
The VGAM
package provides a pospoisson
family function which gives exactly what we want.
VGAM::vglm(art ~ ., family=VGAM::pospoisson(), subset(bioChemists, art > 0)) |>
coefficients()
# (Intercept) femWomen marMarried kid5 phd ment
# 0.67113934 -0.22858262 0.09648498 -0.14218724 -0.01272657 0.01874550
The visual comparison of the two distributions shows that positive poisson is indeed better suited to fit the count part of a hurdle model.
Upvotes: 1