Seth Kasten
Seth Kasten

Reputation: 63

Logistic regression from R returning values greater than one

I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0.

When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful.

library(ResourceSelection)
library(MASS)
data=read.csv(file.choose())
data$Arth2004 = as.factor(data$Arth2004)
data$Gen = as.factor(data$Gen)
data$Smoke1993 = as.factor(data$Smoke1993)
data$HT1993 = as.factor(data$HT1993)
data$HC1993 = as.factor(data$HC1993)
data$BMI1993 = as.numeric(data$BMI1993)

logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial")

summary(logistic)

hoslem.test(logistic$y, fitted(logistic))

confint(logistic)

min(data$BMI1993)
median(data$BMI1993)
max(data$BMI1993)

e=2.71828

The output is as follows:

Call:
glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + 
    HT1993, family = "binomial", data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0362  -1.0513  -0.7831   1.1844   1.8807  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.346104   0.158043 -14.845  < 2e-16 ***
Gen1        -0.748286   0.048398 -15.461  < 2e-16 ***
Smoke19931  -0.059342   0.064606  -0.919    0.358    
BMI1993      0.084056   0.006005  13.997  < 2e-16 ***
HC19931      0.388217   0.047820   8.118 4.72e-16 ***
HT19931      0.341375   0.058423   5.843 5.12e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10890  on 7895  degrees of freedom
Residual deviance: 10309  on 7890  degrees of freedom
AIC: 10321

Number of Fisher Scoring iterations: 4

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  logistic$y, fitted(logistic)
X-squared = 18.293, df = 8, p-value = 0.01913

Waiting for profiling to be done...
                  2.5 %      97.5 %
(Intercept) -2.65715966 -2.03756775
Gen1        -0.84336906 -0.65364134
Smoke19931  -0.18619647  0.06709748
BMI1993      0.07233866  0.09588198
HC19931      0.29454661  0.48200673
HT19931      0.22690608  0.45595006

[1] 18
[1] 26
[1] 43

A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following:

e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104)

[1] 1.7664

I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?

Upvotes: 1

Views: 1842

Answers (2)

Francis
Francis

Reputation: 11

What you have computed is the odds-ratio. The logistic function has this formula:

$$p(X) = \frac{e^{\beta_0+\beta_1X_1+...\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...\beta_pX_p}}$$

p(X) = exp(b_0 + b_1*X_1 + ... + b_k X_k) / [1 + exp(b_0 + b_1*X_1 + ... + b_k X_k)]

the odds-ratio is: $$\frac{p(X)}{1-p(X)} = e^{\beta_0+\beta_1X_1+...\beta_pX_p}$$

p(X) / [1-p(X)] = exp(b_0 + b_1*X_1 + ... + b_k X_k)

The p(X) does need further explanation. It is the probability of X. So when you have a probability of p(X)=0.75 (probability that the coin is heads), the odds is 0.75/(1-0.75 or 3. It is 3 times as likely to have a heads (p=0.75) compare to tails (p=0.25).

In your casep(X) = 1.7664/(1+1.7664)= 0.64.

Upvotes: 1

St&#233;phane Laurent
St&#233;phane Laurent

Reputation: 84539

By default, family = "binomial" uses the logit link function (see ?family). So the probability you're looking for is 1.7664 / (1+1.7664).

Upvotes: 2

Related Questions