Reputation: 1841
Trying to understand the use of logistic regression. I have the following data:
Gender Age No.transcation Transaction
female 18-24 138485 4047
male 18-24 144301 3766
female 25-34 248362 7559
male 25-34 295800 8126
female 35-44 265514 7171
male 35-44 379872 9047
female 45-54 295002 7072
male 45-54 421432 9648
female 55-64 382198 7529
male 55-64 456308 9016
female 65+ 352501 4856
male 65+ 465253 6889
Running logistic regression in R I get the following summary output
> mod2 <- glm(cbind(Transaction, No.transcation) ~ Gender + Age, data = csvd,
family = binomial())
> summary(mod2)
Call:
glm(formula = cbind(Transaction, No.transcation) ~ Gender + Age,
family = binomial(), data = csvd)
Deviance Residuals:
1 2 3 4 5 6
1.8732 -1.9018 2.2654 -2.1473 3.4810 -3.0228
7 8 9 10 11 12
-0.2772 0.2377 -2.5500 2.3717 -4.9638 4.3408
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.562800 0.011984 -297.290 < 2e-16 ***
Gendermale -0.051852 0.006993 -7.415 1.22e-13 ***
Age25-34 0.044091 0.014042 3.140 0.00169 **
Age35-44 -0.090757 0.013966 -6.499 8.11e-11 ***
Age45-54 -0.164705 0.013894 -11.855 < 2e-16 ***
Age55-64 -0.334841 0.013900 -24.088 < 2e-16 ***
Age65+ -0.651142 0.014767 -44.094 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4490.792 on 11 degrees of freedom
Residual deviance: 93.866 on 5 degrees of freedom
AIC: 235.5
Number of Fisher Scoring iterations: 3
Exponentiation the coefficients to get the odds ratio, I find they are almost identical to just the ratio of users with transactions:
> exp(summary(mod2)$coefficients)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02835931 1.012056 7.735499e-130 1.000000
Gendermale 0.94946976 1.007018 6.022806e-04 1.000000
Age25-34 1.04507762 1.014141 2.310243e+01 1.001691
Age35-44 0.91323954 1.014064 1.505641e-03 1.000000
Age45-54 0.84814413 1.013991 7.106341e-06 1.000000
Age55-64 0.71545181 1.013998 3.455562e-11 1.000000
Age65+ 0.52145005 1.014877 7.084264e-20 1.000000
Comparing the odds ratios to just taking the relative ratio of users with transactions divided by total users per group (and comparing it vs the male and 18-24 base group) I get pretty much the same numbers:
female
male 94.68%
18-24
25-34 104.21%
35-44 91.17%
45-54 84.82%
55-64 71.97%
65+ 52.66%
So what's even the point of running logistic regression here? This dataset only has 2 features, but it might as well be extended to 50 features. What use does LR have vs just looking at the ratio for each group in this case? Is it because all variables are nominal that it doesn't add much?
Upvotes: 2
Views: 1641
Reputation: 1294
You would hope that the estimated odds ratio is close to the realised proportions like this. You are estimating the probability pr(Y=1|X=x) ; the probability of a transaction given age rage and gender. With categorical predictors like this an intuitive estimator would be the proportions of outcomes in the data. Logistic regression becomes more interesting when the predictors are continuous variables, and you'd like to predict the probability of an outcome for some value of the predictor that you haven't observed. In these cases LR lets you map an unbounded linear function of your predictor onto a probability which must by definition be bounded between 0 and 1.
Upvotes: 2