Reputation: 99

Explaination of Calculating VIF() GLM in R

I am trying to find VIF value for my Logistic Regression Model. I use vif() function, but I dont know the explanation of the calculation, since I dont know how to find R^2 for Logistic Regression. Does anyone know where the calculation came from?

Here is my data

d=read.delim("http://dnett.github.io/S510/Disease.txt")

d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
d$savings=factor(d$savings)

#Base Level
    d$sector=relevel(d$sector, ref="1")
    d$ses=relevel(d$ses, ref="1")
    d$savings=relevel(d$savings, ref="1")

My Logistic Regression

A=glm(disease~age+ses+sector+savings, family=binomial(link=logit), data=d)
summary(A)

VIF value:

car::vif(A)

Here is the output of VIF

            GVIF Df GVIF^(1/(2*Df))
age     1.118357  1        1.057524
ses     1.263032  2        1.060116
sector  1.119910  1        1.058258
savings 1.346071  1        1.160203

Upvotes: 2

Answers (1)

Kat

Reputation: 18754

R^2 for logistic regression as a performance measure

You can calculate McFadden's Pseudo R^2. This is the log-likelihood of your model divided by the outcome when using 1 as a predictor.

# McFadden's Pseudo R2
1 - logLik(A)/logLik(glm(disease~1, family=binomial(link=logit), data=d))
# 'log Lik.' 0.1062919 (df=6) 

# R2 is .106 or 10.6% explained variance

AUC for logistic regression as a performance measure

However, it is probably better to look at the area under the curve (AUC) for logistic regression.

# extract the fitted values for visualization
fit.p <- predict(A, type = "response")

pROC::roc(d$disease ~ fit.p) %>% plot()
# Call:
#   roc.formula(formula = d$disease ~ fit.p)
# 
# Data: fit.p in 139 controls (d$disease 0) < 57 cases (d$disease 1).
# Area under the curve: 0.7275

# AUC ranges between .5 and 1; larger is better
# measures model performance

Test for Multicollinearity

For the generalized variance inflation factors, values around 1 generally indicate that collinearity is not found.

#---------- check for multicollinearity --------------
# generalized variance inflation factor
car::vif(A) # values around 1 suggest no collinearity
#             GVIF Df GVIF^(1/(2*Df))
# age     1.118357  1        1.057524
# ses     1.263032  2        1.060116
# sector  1.119910  1        1.058258
# savings 1.346071  1        1.160203

Updated: how GVIF/VIF is calculated

I'm not sure where I'm going wrong, but these values are all about .02 off...it will still give you an idea of how these numbers come to be, though.

# make each variable in to matrices
Xage = matrix(d$age)

# you have to drop one factor level for each factor
Xses = cbind(model.matrix(~d$ses + 0))[,-1]
Xsec = cbind(model.matrix(~d$sector + 0))[,-1] %>% as.matrix()
Xsav = cbind(model.matrix(~d$savings + 0))[,-1] %>% as.matrix()

gvif <- function(v1, v2, v3, v4){
  results <- det(cor(v1)) * det(cor(cbind(v2, v3, v4))
              )/ det(cor(cbind(v1, v2, v3, v4)))
  return(results)
}

data.frame(Variables = c("age","ses","sector","savings"),
           GVIF = c(gvif(Xage,Xses,Xsec,Xsav),
                    gvif(Xses,Xage,Xsec,Xsav),
                    gvif(Xsec,Xage,Xses,Xsav),
                    gvif(Xsav,Xage,Xses,Xsec)),
           Df = c(ncol(Xage),ncol(Xses),ncol(Xsec),ncol(Xsav))) %>% 
  mutate("GVIF^1/2*Df" = GVIF^(1/(2*Df)))

#   Variables     GVIF Df GVIF^1/2*Df
# 1       age 1.099167  1    1.048412
# 2       ses 1.240770  2    1.055414
# 3    sector 1.101453  1    1.049502
# 4   savings 1.329336  1    1.152968

Upvotes: 1