Reputation: 65
I have a data set (example below) and I want to regress with multiply regressors. When I want to test the multicollinearity with vif()
I get the Error saying:
Error in UseMethod("vcov") : inapplicable method for 'vcov' applied to object of class "data.frame". Additional: Warning message: Unknown or uninitialised column:
coefficients
.
-->The Error is translated
My code is the following:
subset_cor <- subset(Relevante_V03,
select=c(xsga, xrd, at, sale, intr, inflr,
sale_py, at_py, sale_py_at_py, value, dt,
re, txt, p_con, marketingspending, R_at_py))
library(tidyverse)
library(broom)
lm01 <- Relevante_V03 %>%
mutate(sic = factor(sic)) %>%
group_by(sic) %>%
filter(n() >= 10) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(marketingspending ~ intr + sale_py_at_py + R_at_py, data=df) %>%
glance() %>%
add_column(sic = unique(df$sic), .before = 1)
})
library(car)
#paris(Relevante_V03, lower.panel=panel.smooth, cex = .8, pch = 21,
bg="steelblue", diag.panel=panel.hist, cex.labels = 1.2, font.labels=2,
upper.panel=panel.cor)
vif(lm01)
(Also it does not let me do the Paris()
function part. Is there an other option to give me data to that?)
My Data set looks something like this:
Name | Segment | Sale | Year | Asset | Another header |
---|---|---|---|---|---|
A | 3401 | 10000 | 2000 | 200000 | x |
A | 3401 | 20000 | 2001 | 250000 | x |
B | 2201 | 15000 | 2004 | 280000 | x |
B | 2201 | 23000 | 2009 | 320000 | x |
B | 2201 | 28000 | 2010 | 390000 | x |
C | 2201 | 30000 | 2000 | 210000 | x |
C | 2201 | 18000 | 2004 | 200000 | x |
D | 1 | 28000 | 2000 | 400000 | x |
D | 1 | 38000 | 2001 | 521000 | x |
How can I fix this problem?
I want to extend the lm() regression afterwards with more variables do I have to consider something for that? (for example is a character variable okay?)
Edit:
Based on the help in the comments I changed my approach to the following code (errors will be mentioned in code after the code it appeard):
install.packages(c("funModeling","gvlma", "caret"))
install.packages("ggplot2")
install.packages("tidyverse")
install.packages("broom")
library(ggplot2)
library(funModeling)
library(gvlma)
library(caret)
library(tidyverse)
library(broom)
table_lm01 <- df_status(Relevante_V03) #ganz am Anfang sinnvoller?
#splitting data
mpg <- Relevante_V03 %>%
mutate(fdyear = factor(fdyear),
sic = factor(sic)) %>%
group_by(fdyear, sic) %>%
filter(n() >= 10) %>%
group_split()
# set the seed for repeatability
set.seed(59385)
# create training vector of indices for each group for training the model
tr <- mpg %>%
map(~ createDataPartition(.x$marketingspending,
list = F,
p = .7))
# now map over mpg groups and tr vectors
lm01 <- map(1:98, ~lm(marketingspending ~ intr + sale_py_at_py + R_at_py + dt
+ p_con + intr, data = mpg[[.x]][tr[[.x]],]))
resMod <- map_dfr(lm01, glance) %>%
add_column(ySic = map_chr(mpg, ~paste(unique(.x$fdyear),
unique(.x$sic),
sep = ".")),
.before = 1)
#testing lm
testLM <- map(1:98,
~predict(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]))
#Error: There were 50 or more warnings (display the first 50 with warnings())
->50: In predict.lm(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]) : prediction from a rank-deficient fit may be misleading
resPred <- map_dfr(1:98,
~postResample(testLM[[.x]],
mpg[[.x]][-tr[[.x]], ]$marketingspending))
resMod <- map_dfr(lm01, glance) %>%
add_column(yCyl = map_chr(mpg, ~paste(unique(.x$fdyear),
unique(.x$sic),
sep = ".")),
.before = 1)
View(resMod)
#tsSigma <- map(1:98, ~sqrt(sum((mpg[[.x]][-tr[[.x]], ]$marketingspending -
# testLM[[.x]])^2)/lm01[[.x]]$df.residual))
V#iew(tsSigma)
#testing assumption that regression is a "good" regression
map(lm01, summary)
cbind01 <- cbind(resMod[, 1:3], resPred) %>%
as.data.frame() %>%
mutate(fit = ifelse(r.squared > Rsquared,
"Check for overfit",
"Overfit unlikely"))
#cbind01$fit %>% summarise_all( ~ sum(is.na(x)))
#table(cbind01$fit)
install.packages(c("car"))
library(car)
vif.lm <- Relevante_V03 %>%
mutate(sic = factor(sic)) %>%
group_by(sic, fdyear) %>%
filter(n() >= 10) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(marketingspending ~ intr + sale_py_at_py + R_at_py + txt + value + dt, data=Relevante_V03) %>%
vif()
})
ySic | r.squared | adj.r.squared | RMSE | Rsquared | MAE | fit |
---|---|---|---|---|---|---|
2000.2832 | 0.3059 | 0.0535 | 19162165 | 0.6041 | 182... | O u |
2001.3674 | 0.4932 | 0.2035 | 3479144 | 0.9801 | 270... | O u |
2001.3826 | 0.4932 | 0.2035 | NA | NA | NA | NA |
2003.3954 | 0.9356 | 0.8718 | 7460919 | 0.0001 | 480... | Cfo |
O u= Overfit unlikely Cfo = Check for overfit
Furthermore in the result of the "cbind" I get a table for 98 lines and some have the result NA (how is that possible?) and some "check for overfitting" and some "overfit unlikely". Question to that how can I get the exact number of the 3 options in that column? And when do I have to say that it is overfitting or when not? Is it a percentage or anything else?
Thanks to @Kat for helping me getting this far :)
Upvotes: 2
Views: 1403
Reputation: 18714
If you're happy with the structure in which you created our lm()
output, just do the same with the vif()
output. You can't use the model when there is there no model. Your output is just metrics. Metrics won't work for finding the vif
. Copy the code, but instead of glance
, vif
.
vif.lm <- Relevante_V03 %>%
mutate(sic = factor(sic)) %>%
group_by(sic) %>%
filter(n() >= 10) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(marketingspending ~ intr + sale_py_at_py + R_at_py, data=df) %>%
vif()
})
Based on your comment, I thought I would provide more information regarding the assumptions and testing the model.
For this, I grabbed some readily available data in R. I used mpg
from the ggplot2
package.
library(funModeling) # just for df_status
# ran before tidyverse, so dplyr has precedence
library(gvlma) # used for assumptions of MLR
library(caret) # used for partitioning and prediction metrics
library(tidyverse) # from your original code
library(broom) # from your original code
# inspecting the data
str(mpg)
df_status(mpg)
Next, I split the data into groups, as you did, but without running the model.
# splitting the data
mpg <- mpg %>%
mutate(year = factor(year),
cyl = factor(cyl)) %>%
group_by(year, cyl) %>%
filter(n() >= 10) %>%
group_split()
Now I will partition the data so that the models can first be trained then tested. I will also use the model to assess the assumptions of this type of regression.
# set the seed for repeatability
set.seed(59385)
# create training vector of indices for each group for training the model
tr <- mpg %>%
map(~ createDataPartition(.x$displ,
list = F,
p = .7))
The models will be trained using the vectors of indices in tr
(one for each group).
# now map over mpg groups and tr vectors
lm01 <- map(1:6, ~lm(displ ~ cty + hwy, data = mpg[[.x]][tr[[.x]],]))
The tidy setup for the results can still be obtained, just like your original work.
resMod <- map_dfr(lm01, glance) %>%
add_column(yCyl = map_chr(mpg, ~paste(unique(.x$year),
unique(.x$cyl),
sep = ".")),
.before = 1)
There are a lot of ways to test the assumptions. One of the more thorough methods I have found is from the package gvlma
.
map(lm01, gvlma)
For this data, only the first model did not meet the assumptions:
# [[1]]
#
# Call:
# lm(formula = displ ~ cty + hwy, data = mpg[[.x]][tr[[.x]], ])
#
# Coefficients:
# (Intercept) cty hwy
# 3.35394 0.02865 -0.06536
#
#
# ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
# USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
# Level of Significance = 0.05
#
# Call:
# .f(x = .x[[i]])
#
# Value p-value Decision
# Global Stat 13.80206 0.0079543 Assumptions NOT satisfied!
# Skewness 0.07551 0.7834808 Assumptions acceptable.
# Kurtosis 1.46654 0.2258932 Assumptions acceptable.
# Link Function 11.77835 0.0005992 Assumptions NOT satisfied!
# Heteroscedasticity 0.48167 0.4876676 Assumptions acceptable.
You can test the models by first predicting with the test data, then getting the metrics for the predictions versus actuals with postResample
from the caret
package.
# now for testing the models
testLM <- map(1:6,
~predict(lm01[[.x]], mpg[[.x]][-tr[[.x]], ]))
# find the results
resPred <- map_dfr(1:6,
~postResample(testLM[[.x]],
mpg[[.x]][-tr[[.x]], ]$displ))
# # A tibble: 6 × 3
# RMSE Rsquared MAE
# <dbl> <dbl> <dbl>
# 1 0.282 0.211 0.256
# 2 0.430 0.235 0.374
# 3 0.836 0.175 0.635
# 4 0.210 0.497 0.186
# 5 0.293 0.266 0.265
# 6 0.474 0.163 0.434
To make it easier to recognize if there is a problem, I compared the training and testing like this:
cbind(resMod[, 1:3], resPred) %>%
as.data.frame() %>%
mutate(fit = ifelse(r.squared > Rsquared,
"Check for overfit",
"Overfit unlikely"))
# yCyl r.squared adj.r.squared RMSE Rsquared MAE fit
# 1 1999.4 0.4506805 0.41405924 0.2821392 0.2106108 0.2556183 Check for overfit
# 2 1999.6 0.5367920 0.50591147 0.4299883 0.2345990 0.3738415 Check for overfit
# 3 1999.8 0.1447143 0.04409243 0.8357696 0.1750658 0.6354507 Overfit unlikely
# 4 2008.4 0.2857471 0.22363818 0.2095897 0.4969283 0.1861229 Overfit unlikely
# 5 2008.6 0.1041972 0.02630127 0.2929450 0.2656493 0.2645665 Overfit unlikely
# 6 2008.8 0.1266099 0.06422494 0.4743184 0.1628117 0.4343889 Overfit unlikely
In this example, the first two models are questionable, at best. The first model did not meet the assumptions; that is probably why there is so much volatility between training and testing. The second model must be interesting...
Upvotes: 1