New_to_r
New_to_r

Reputation: 65

Testing Multicollinearity with R (vif)

I have a data set (example below) and I want to regress with multiply regressors. When I want to test the multicollinearity with vif() I get the Error saying:

Error in UseMethod("vcov") : inapplicable method for 'vcov' applied to object of class "data.frame". Additional: Warning message: Unknown or uninitialised column: coefficients.

-->The Error is translated

My code is the following:

subset_cor <- subset(Relevante_V03, 
                     select=c(xsga, xrd, at, sale, intr, inflr, 
                               sale_py, at_py, sale_py_at_py, value, dt,
                                re, txt, p_con, marketingspending, R_at_py))
    
library(tidyverse)
library(broom)
lm01 <- Relevante_V03 %>% 
  mutate(sic = factor(sic)) %>% 
  group_by(sic) %>% 
  filter(n() >= 10) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(marketingspending ~ intr + sale_py_at_py + R_at_py, data=df) %>% 
      glance() %>% 
      add_column(sic = unique(df$sic), .before = 1)
  }) 

library(car)
#paris(Relevante_V03, lower.panel=panel.smooth, cex = .8, pch = 21,
       bg="steelblue", diag.panel=panel.hist, cex.labels = 1.2, font.labels=2,
        upper.panel=panel.cor)
vif(lm01)

(Also it does not let me do the Paris() function part. Is there an other option to give me data to that?)

My Data set looks something like this:

Name Segment Sale Year Asset Another header
A 3401 10000 2000 200000 x
A 3401 20000 2001 250000 x
B 2201 15000 2004 280000 x
B 2201 23000 2009 320000 x
B 2201 28000 2010 390000 x
C 2201 30000 2000 210000 x
C 2201 18000 2004 200000 x
D 1 28000 2000 400000 x
D 1 38000 2001 521000 x

How can I fix this problem?

I want to extend the lm() regression afterwards with more variables do I have to consider something for that? (for example is a character variable okay?)

Edit:

Based on the help in the comments I changed my approach to the following code (errors will be mentioned in code after the code it appeard):

install.packages(c("funModeling","gvlma", "caret"))
install.packages("ggplot2")
install.packages("tidyverse")
install.packages("broom")
library(ggplot2)
library(funModeling)
library(gvlma)
library(caret)
library(tidyverse)
library(broom)

table_lm01 <- df_status(Relevante_V03) #ganz am Anfang sinnvoller?

#splitting data
mpg <- Relevante_V03 %>% 
  mutate(fdyear = factor(fdyear),
         sic = factor(sic)) %>% 
  group_by(fdyear, sic) %>% 
  filter(n() >= 10) %>% 
  group_split()

# set the seed for repeatability
set.seed(59385)


# create training vector of indices for each group for training the model
tr <- mpg %>% 
  map(~ createDataPartition(.x$marketingspending, 
                            list = F, 
                            p = .7))


# now map over mpg groups and tr vectors
lm01 <- map(1:98, ~lm(marketingspending ~ intr + sale_py_at_py + R_at_py + dt
                      + p_con + intr,  data = mpg[[.x]][tr[[.x]],]))
                     
                       
                       
                       


resMod <- map_dfr(lm01, glance) %>% 
  add_column(ySic = map_chr(mpg, ~paste(unique(.x$fdyear), 
                                        unique(.x$sic), 
                                        sep = ".")),
             .before = 1)


#testing lm 
testLM <- map(1:98,
              ~predict(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]))

#Error: There were 50 or more warnings (display the first 50 with warnings())

->50: In predict.lm(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]) : prediction from a rank-deficient fit may be misleading

resPred <- map_dfr(1:98,
                   ~postResample(testLM[[.x]], 
                                 mpg[[.x]][-tr[[.x]], ]$marketingspending))

resMod <- map_dfr(lm01, glance) %>% 
  add_column(yCyl = map_chr(mpg, ~paste(unique(.x$fdyear), 
                                        unique(.x$sic), 
                                        sep = ".")),
             .before = 1)

View(resMod)

#tsSigma <- map(1:98, ~sqrt(sum((mpg[[.x]][-tr[[.x]], ]$marketingspending - 
#                                   testLM[[.x]])^2)/lm01[[.x]]$df.residual))

V#iew(tsSigma)

#testing assumption that regression is a "good" regression

map(lm01, summary)


cbind01 <- cbind(resMod[, 1:3], resPred) %>% 
  as.data.frame() %>% 
  mutate(fit = ifelse(r.squared > Rsquared, 
                      "Check for overfit",
                      "Overfit unlikely"))

#cbind01$fit %>% summarise_all( ~ sum(is.na(x)))
#table(cbind01$fit)

install.packages(c("car"))
library(car)
vif.lm <- Relevante_V03 %>% 
  mutate(sic = factor(sic)) %>% 
  group_by(sic, fdyear) %>% 
  filter(n() >= 10) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(marketingspending ~ intr + sale_py_at_py + R_at_py + txt + value + dt, data=Relevante_V03) %>% 
      vif()
  })
   
ySic r.squared adj.r.squared RMSE Rsquared MAE fit
2000.2832 0.3059 0.0535 19162165 0.6041 182... O u
2001.3674 0.4932 0.2035 3479144 0.9801 270... O u
2001.3826 0.4932 0.2035 NA NA NA NA
2003.3954 0.9356 0.8718 7460919 0.0001 480... Cfo

O u= Overfit unlikely Cfo = Check for overfit

Furthermore in the result of the "cbind" I get a table for 98 lines and some have the result NA (how is that possible?) and some "check for overfitting" and some "overfit unlikely". Question to that how can I get the exact number of the 3 options in that column? And when do I have to say that it is overfitting or when not? Is it a percentage or anything else?

Thanks to @Kat for helping me getting this far :)

Upvotes: 2

Views: 1403

Answers (1)

Kat
Kat

Reputation: 18714

If you're happy with the structure in which you created our lm() output, just do the same with the vif() output. You can't use the model when there is there no model. Your output is just metrics. Metrics won't work for finding the vif. Copy the code, but instead of glance, vif.

vif.lm <- Relevante_V03 %>% 
  mutate(sic = factor(sic)) %>% 
  group_by(sic) %>% 
  filter(n() >= 10) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(marketingspending ~ intr + sale_py_at_py + R_at_py, data=df) %>% 
      vif()
  }) 


Based on your comment, I thought I would provide more information regarding the assumptions and testing the model.

For this, I grabbed some readily available data in R. I used mpg from the ggplot2 package.

library(funModeling) # just for df_status
                     # ran before tidyverse, so dplyr has precedence
library(gvlma)       # used for assumptions of MLR
library(caret)       # used for partitioning and prediction metrics
library(tidyverse)   # from your original code
library(broom)       # from your original code

# inspecting the data
str(mpg)
df_status(mpg)

Next, I split the data into groups, as you did, but without running the model.

# splitting the data
mpg <- mpg %>% 
  mutate(year = factor(year),
         cyl = factor(cyl)) %>% 
  group_by(year, cyl) %>% 
  filter(n() >= 10) %>% 
  group_split()

Now I will partition the data so that the models can first be trained then tested. I will also use the model to assess the assumptions of this type of regression.

# set the seed for repeatability
set.seed(59385)
# create training vector of indices for each group for training the model
tr <- mpg %>% 
  map(~ createDataPartition(.x$displ, 
                            list = F, 
                            p = .7))

The models will be trained using the vectors of indices in tr (one for each group).

# now map over mpg groups and tr vectors
lm01 <- map(1:6, ~lm(displ ~ cty + hwy, data = mpg[[.x]][tr[[.x]],]))

The tidy setup for the results can still be obtained, just like your original work.

resMod <- map_dfr(lm01, glance) %>% 
  add_column(yCyl = map_chr(mpg, ~paste(unique(.x$year), 
                                        unique(.x$cyl), 
                                        sep = ".")),
             .before = 1)

There are a lot of ways to test the assumptions. One of the more thorough methods I have found is from the package gvlma.

map(lm01, gvlma)

For this data, only the first model did not meet the assumptions:

# [[1]]
# 
# Call:
# lm(formula = displ ~ cty + hwy, data = mpg[[.x]][tr[[.x]], ])
# 
# Coefficients:
# (Intercept)          cty          hwy  
#     3.35394      0.02865     -0.06536  
# 
# 
# ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
# USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
# Level of Significance =  0.05 
# 
# Call:
#  .f(x = .x[[i]]) 
# 
#                       Value   p-value                   Decision
# Global Stat        13.80206 0.0079543 Assumptions NOT satisfied!
# Skewness            0.07551 0.7834808    Assumptions acceptable.
# Kurtosis            1.46654 0.2258932    Assumptions acceptable.
# Link Function      11.77835 0.0005992 Assumptions NOT satisfied!
# Heteroscedasticity  0.48167 0.4876676    Assumptions acceptable.

You can test the models by first predicting with the test data, then getting the metrics for the predictions versus actuals with postResample from the caret package.

# now for testing the models
testLM <- map(1:6,
              ~predict(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]))

# find the results
resPred <- map_dfr(1:6,
               ~postResample(testLM[[.x]], 
                             mpg[[.x]][-tr[[.x]], ]$displ))
# # A tibble: 6 × 3
#    RMSE Rsquared   MAE
#   <dbl>    <dbl> <dbl>
# 1 0.282    0.211 0.256
# 2 0.430    0.235 0.374
# 3 0.836    0.175 0.635
# 4 0.210    0.497 0.186
# 5 0.293    0.266 0.265
# 6 0.474    0.163 0.434 

To make it easier to recognize if there is a problem, I compared the training and testing like this:

cbind(resMod[, 1:3], resPred) %>% 
  as.data.frame() %>% 
  mutate(fit = ifelse(r.squared > Rsquared, 
                      "Check for overfit",
                      "Overfit unlikely"))
#     yCyl r.squared adj.r.squared      RMSE  Rsquared       MAE               fit
# 1 1999.4 0.4506805    0.41405924 0.2821392 0.2106108 0.2556183 Check for overfit
# 2 1999.6 0.5367920    0.50591147 0.4299883 0.2345990 0.3738415 Check for overfit
# 3 1999.8 0.1447143    0.04409243 0.8357696 0.1750658 0.6354507  Overfit unlikely
# 4 2008.4 0.2857471    0.22363818 0.2095897 0.4969283 0.1861229  Overfit unlikely
# 5 2008.6 0.1041972    0.02630127 0.2929450 0.2656493 0.2645665  Overfit unlikely
# 6 2008.8 0.1266099    0.06422494 0.4743184 0.1628117 0.4343889  Overfit unlikely 

In this example, the first two models are questionable, at best. The first model did not meet the assumptions; that is probably why there is so much volatility between training and testing. The second model must be interesting...

Upvotes: 1

Related Questions