Reputation: 63
I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0.
When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful.
library(ResourceSelection)
library(MASS)
data=read.csv(file.choose())
data$Arth2004 = as.factor(data$Arth2004)
data$Gen = as.factor(data$Gen)
data$Smoke1993 = as.factor(data$Smoke1993)
data$HT1993 = as.factor(data$HT1993)
data$HC1993 = as.factor(data$HC1993)
data$BMI1993 = as.numeric(data$BMI1993)
logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial")
summary(logistic)
hoslem.test(logistic$y, fitted(logistic))
confint(logistic)
min(data$BMI1993)
median(data$BMI1993)
max(data$BMI1993)
e=2.71828
The output is as follows:
Call:
glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 +
HT1993, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0362 -1.0513 -0.7831 1.1844 1.8807
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.346104 0.158043 -14.845 < 2e-16 ***
Gen1 -0.748286 0.048398 -15.461 < 2e-16 ***
Smoke19931 -0.059342 0.064606 -0.919 0.358
BMI1993 0.084056 0.006005 13.997 < 2e-16 ***
HC19931 0.388217 0.047820 8.118 4.72e-16 ***
HT19931 0.341375 0.058423 5.843 5.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10890 on 7895 degrees of freedom
Residual deviance: 10309 on 7890 degrees of freedom
AIC: 10321
Number of Fisher Scoring iterations: 4
Hosmer and Lemeshow goodness of fit (GOF) test
data: logistic$y, fitted(logistic)
X-squared = 18.293, df = 8, p-value = 0.01913
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.65715966 -2.03756775
Gen1 -0.84336906 -0.65364134
Smoke19931 -0.18619647 0.06709748
BMI1993 0.07233866 0.09588198
HC19931 0.29454661 0.48200673
HT19931 0.22690608 0.45595006
[1] 18
[1] 26
[1] 43
A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following:
e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104)
[1] 1.7664
I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?
Upvotes: 1
Views: 1842
Reputation: 11
What you have computed is the odds-ratio. The logistic function has this formula:
$$p(X) = \frac{e^{\beta_0+\beta_1X_1+...\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...\beta_pX_p}}$$
p(X) = exp(b_0 + b_1*X_1 + ... + b_k X_k) / [1 + exp(b_0 + b_1*X_1 + ... + b_k X_k)]
the odds-ratio is: $$\frac{p(X)}{1-p(X)} = e^{\beta_0+\beta_1X_1+...\beta_pX_p}$$
p(X) / [1-p(X)] = exp(b_0 + b_1*X_1 + ... + b_k X_k)
The p(X)
does need further explanation. It is the probability of X. So when you have a probability of p(X)=0.75
(probability that the coin is heads), the odds is 0.75/(1-0.75
or 3. It is 3 times as likely to have a heads (p=0.75
) compare to tails (p=0.25
).
In your casep(X) = 1.7664/(1+1.7664)= 0.64
.
Upvotes: 1
Reputation: 84539
By default, family = "binomial"
uses the logit
link function (see ?family
). So the probability you're looking for is 1.7664 / (1+1.7664)
.
Upvotes: 2