Testing Multicollinearity with R (vif)

Question

I have a data set (example below) and I want to regress with multiply regressors. When I want to test the multicollinearity with vif() I get the Error saying:

Error in UseMethod("vcov") : inapplicable method for 'vcov' applied to object of class "data.frame". Additional: Warning message: Unknown or uninitialised column: coefficients.

-->The Error is translated

My code is the following:

subset_cor <- subset(Relevante_V03, 
                     select=c(xsga, xrd, at, sale, intr, inflr, 
                               sale_py, at_py, sale_py_at_py, value, dt,
                                re, txt, p_con, marketingspending, R_at_py))
    
library(tidyverse)
library(broom)
lm01 <- Relevante_V03 %>% 
  mutate(sic = factor(sic)) %>% 
  group_by(sic) %>% 
  filter(n() >= 10) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(marketingspending ~ intr + sale_py_at_py + R_at_py, data=df) %>% 
      glance() %>% 
      add_column(sic = unique(df$sic), .before = 1)
  }) 

library(car)
#paris(Relevante_V03, lower.panel=panel.smooth, cex = .8, pch = 21,
       bg="steelblue", diag.panel=panel.hist, cex.labels = 1.2, font.labels=2,
        upper.panel=panel.cor)
vif(lm01)

(Also it does not let me do the Paris() function part. Is there an other option to give me data to that?)

My Data set looks something like this:

Name	Segment	Sale	Year	Asset	Another header
A	3401	10000	2000	200000	x
A	3401	20000	2001	250000	x
B	2201	15000	2004	280000	x
B	2201	23000	2009	320000	x
B	2201	28000	2010	390000	x
C	2201	30000	2000	210000	x
C	2201	18000	2004	200000	x
D	1	28000	2000	400000	x
D	1	38000	2001	521000	x

How can I fix this problem?

I want to extend the lm() regression afterwards with more variables do I have to consider something for that? (for example is a character variable okay?)

Edit:

Based on the help in the comments I changed my approach to the following code (errors will be mentioned in code after the code it appeard):

install.packages(c("funModeling","gvlma", "caret"))
install.packages("ggplot2")
install.packages("tidyverse")
install.packages("broom")
library(ggplot2)
library(funModeling)
library(gvlma)
library(caret)
library(tidyverse)
library(broom)

table_lm01 <- df_status(Relevante_V03) #ganz am Anfang sinnvoller?

#splitting data
mpg <- Relevante_V03 %>% 
  mutate(fdyear = factor(fdyear),
         sic = factor(sic)) %>% 
  group_by(fdyear, sic) %>% 
  filter(n() >= 10) %>% 
  group_split()

# set the seed for repeatability
set.seed(59385)


# create training vector of indices for each group for training the model
tr <- mpg %>% 
  map(~ createDataPartition(.x$marketingspending, 
                            list = F, 
                            p = .7))


# now map over mpg groups and tr vectors
lm01 <- map(1:98, ~lm(marketingspending ~ intr + sale_py_at_py + R_at_py + dt
                      + p_con + intr,  data = mpg[[.x]][tr[[.x]],]))
                     
                       
                       
                       


resMod <- map_dfr(lm01, glance) %>% 
  add_column(ySic = map_chr(mpg, ~paste(unique(.x$fdyear), 
                                        unique(.x$sic), 
                                        sep = ".")),
             .before = 1)


#testing lm 
testLM <- map(1:98,
              ~predict(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]))

#Error: There were 50 or more warnings (display the first 50 with warnings())

->50: In predict.lm(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]) : prediction from a rank-deficient fit may be misleading

resPred <- map_dfr(1:98,
                   ~postResample(testLM[[.x]], 
                                 mpg[[.x]][-tr[[.x]], ]$marketingspending))

resMod <- map_dfr(lm01, glance) %>% 
  add_column(yCyl = map_chr(mpg, ~paste(unique(.x$fdyear), 
                                        unique(.x$sic), 
                                        sep = ".")),
             .before = 1)

View(resMod)

#tsSigma <- map(1:98, ~sqrt(sum((mpg[[.x]][-tr[[.x]], ]$marketingspending - 
#                                   testLM[[.x]])^2)/lm01[[.x]]$df.residual))

V#iew(tsSigma)

#testing assumption that regression is a "good" regression

map(lm01, summary)


cbind01 <- cbind(resMod[, 1:3], resPred) %>% 
  as.data.frame() %>% 
  mutate(fit = ifelse(r.squared > Rsquared, 
                      "Check for overfit",
                      "Overfit unlikely"))

#cbind01$fit %>% summarise_all( ~ sum(is.na(x)))
#table(cbind01$fit)

install.packages(c("car"))
library(car)
vif.lm <- Relevante_V03 %>% 
  mutate(sic = factor(sic)) %>% 
  group_by(sic, fdyear) %>% 
  filter(n() >= 10) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(marketingspending ~ intr + sale_py_at_py + R_at_py + txt + value + dt, data=Relevante_V03) %>% 
      vif()
  })

ySic	r.squared	adj.r.squared	RMSE	Rsquared	MAE	fit
2000.2832	0.3059	0.0535	19162165	0.6041	182...	O u
2001.3674	0.4932	0.2035	3479144	0.9801	270...	O u
2001.3826	0.4932	0.2035	NA	NA	NA	NA
2003.3954	0.9356	0.8718	7460919	0.0001	480...	Cfo

O u= Overfit unlikely Cfo = Check for overfit

Furthermore in the result of the "cbind" I get a table for 98 lines and some have the result NA (how is that possible?) and some "check for overfitting" and some "overfit unlikely". Question to that how can I get the exact number of the 3 options in that column? And when do I have to say that it is overfitting or when not? Is it a percentage or anything else?

Thanks to @Kat for helping me getting this far :)

Testing Multicollinearity with R (vif)

Answers (1)

Related Questions