Wandering_geek
Wandering_geek

Reputation: 327

Extract confidence interval for both values of binary variable for glm()?

I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.

The reproducible data set:

smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5

I have used glm() to analyse this relation:

glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)

Using the above I'm able to extract the p-value and the confidence interval for the entire set.

However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:

Alcohol drinks, mean (95%CI) p-values
Smokers X (X - X) 0.492
Non-smokers X (X - X)

How can I produce this?

Upvotes: 1

Views: 650

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76412

First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.

To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.

x <- 'smoking_status  alcohol_drinks
1   2
0   5
1   2
0   1
1   0
1   0
0   0
1   9
1   6
1   5'
df1 <- read.table(textConnection(x), header = TRUE)


pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#> 
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"), 
#>     data = df1)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6186  -1.7093  -0.8104   1.1389   2.4957  
#> 
#> Coefficients:
#>                         Estimate Std. Error z value Pr(>|z|)    
#> factor(smoking_status)0   0.6931     0.4082   1.698   0.0895 .  
#> factor(smoking_status)1   1.2321     0.2041   6.036 1.58e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 58.785  on 10  degrees of freedom
#> Residual deviance: 31.324  on  8  degrees of freedom
#> AIC: 57.224
#> 
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#>                              2.5 %   97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1  0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#>                             2.5 %   97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822

Created on 2022-06-04 by the reprex package (v2.0.1)


Edit

The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.

bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#> 
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"), 
#>     data = df1)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.7491  -0.8722   0.6705   0.8896   1.0339  
#> 
#> Coefficients:
#>                Estimate Std. Error z value Pr(>|z|)
#> (Intercept)      0.3474     0.9513   0.365    0.715
#> alcohol_drinks   0.1877     0.2730   0.687    0.492
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 12.217  on 9  degrees of freedom
#> Residual deviance: 11.682  on 8  degrees of freedom
#> AIC: 15.682
#> 
#> Number of Fisher Scoring iterations: 4

# Odds ratios
exp(coef(bin_fit))
#>    (Intercept) alcohol_drinks 
#>       1.415412       1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#>                    2.5 %    97.5 %
#> (Intercept)    0.2146432 11.167555
#> alcohol_drinks 0.7464740  2.417211

Created on 2022-06-05 by the reprex package (v2.0.1)


Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.

df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)

fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#> 
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~ 
#>     alcohol_drinks, family = binomial(), data = df2)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.9714  -0.2152   0.1369   0.2942   0.8975  
#> 
#> Coefficients:
#>                Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -1.1671     0.3988  -2.927 0.003428 ** 
#> alcohol_drinks   0.4437     0.1168   3.798 0.000146 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 23.3150  on 9  degrees of freedom
#> Residual deviance:  3.0294  on 8  degrees of freedom
#> AIC: 27.226
#> 
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#>    (Intercept) alcohol_drinks 
#>      0.3112572      1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#>                    2.5 %    97.5 %
#> (Intercept)    0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079

plot(smoking_status/Total ~ alcohol_drinks, 
     data = df2,
     xlab = "Alcoholic Drinks",
     ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")

Created on 2022-06-05 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions