Reputation: 99
I am trying to find VIF value for my Logistic Regression Model. I use vif() function, but I dont know the explanation of the calculation, since I dont know how to find R^2 for Logistic Regression. Does anyone know where the calculation came from?
Here is my data
d=read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
d$savings=factor(d$savings)
#Base Level
d$sector=relevel(d$sector, ref="1")
d$ses=relevel(d$ses, ref="1")
d$savings=relevel(d$savings, ref="1")
My Logistic Regression
A=glm(disease~age+ses+sector+savings, family=binomial(link=logit), data=d)
summary(A)
VIF value:
car::vif(A)
Here is the output of VIF
GVIF Df GVIF^(1/(2*Df))
age 1.118357 1 1.057524
ses 1.263032 2 1.060116
sector 1.119910 1 1.058258
savings 1.346071 1 1.160203
Upvotes: 2
Views: 3806
Reputation: 18754
You can calculate McFadden's Pseudo R^2. This is the log-likelihood of your model divided by the outcome when using 1 as a predictor.
# McFadden's Pseudo R2
1 - logLik(A)/logLik(glm(disease~1, family=binomial(link=logit), data=d))
# 'log Lik.' 0.1062919 (df=6)
# R2 is .106 or 10.6% explained variance
However, it is probably better to look at the area under the curve (AUC) for logistic regression.
# extract the fitted values for visualization
fit.p <- predict(A, type = "response")
pROC::roc(d$disease ~ fit.p) %>% plot()
# Call:
# roc.formula(formula = d$disease ~ fit.p)
#
# Data: fit.p in 139 controls (d$disease 0) < 57 cases (d$disease 1).
# Area under the curve: 0.7275
# AUC ranges between .5 and 1; larger is better
# measures model performance
For the generalized variance inflation factors, values around 1 generally indicate that collinearity is not found.
#---------- check for multicollinearity --------------
# generalized variance inflation factor
car::vif(A) # values around 1 suggest no collinearity
# GVIF Df GVIF^(1/(2*Df))
# age 1.118357 1 1.057524
# ses 1.263032 2 1.060116
# sector 1.119910 1 1.058258
# savings 1.346071 1 1.160203
I'm not sure where I'm going wrong, but these values are all about .02 off...it will still give you an idea of how these numbers come to be, though.
# make each variable in to matrices
Xage = matrix(d$age)
# you have to drop one factor level for each factor
Xses = cbind(model.matrix(~d$ses + 0))[,-1]
Xsec = cbind(model.matrix(~d$sector + 0))[,-1] %>% as.matrix()
Xsav = cbind(model.matrix(~d$savings + 0))[,-1] %>% as.matrix()
gvif <- function(v1, v2, v3, v4){
results <- det(cor(v1)) * det(cor(cbind(v2, v3, v4))
)/ det(cor(cbind(v1, v2, v3, v4)))
return(results)
}
data.frame(Variables = c("age","ses","sector","savings"),
GVIF = c(gvif(Xage,Xses,Xsec,Xsav),
gvif(Xses,Xage,Xsec,Xsav),
gvif(Xsec,Xage,Xses,Xsav),
gvif(Xsav,Xage,Xses,Xsec)),
Df = c(ncol(Xage),ncol(Xses),ncol(Xsec),ncol(Xsav))) %>%
mutate("GVIF^1/2*Df" = GVIF^(1/(2*Df)))
# Variables GVIF Df GVIF^1/2*Df
# 1 age 1.099167 1 1.048412
# 2 ses 1.240770 2 1.055414
# 3 sector 1.101453 1 1.049502
# 4 savings 1.329336 1 1.152968
Upvotes: 1