Reputation: 539
I am trying to identify correlated explanatory variables and eliminate. I'm using Sapply to apply the regression to the variables I am interested in, and manually deleting the ones with FIVs > 10. However, when I try to reproduce this to quickly run for many vifs, I cannot manage to get my regression script to run with a pasted formula object containing the the names I want to keep. Below:
regressiondata <- data.frame(matrix(ncol=9,nrow=100,runif(900,1,100)))
colnames(regressiondata) <- c("indep1","indep2","indep3","indep4","var1","var2","var3","var4","var5")
vifs1_model <- sapply(regressiondata[,indep_variables],function(x) vif(lm(x~var1+var2+var3+var4+var5,
data = regressiondata,
na.action=na.exclude)))
vifs1 <- rowMeans(vifs1_model)
formula_variables <- paste(names(vifs1),collapse="+")
final_model <- t(round(sapply(regressiondata[,indep_variables],
function(x) lm(x ~ formula_variables,data=regressiondata,na.action=na.exclude)$coef),2))
when I run "final_model" I get this error:
Error in t(round(sapply(regressiondata[, indep_variables], function(x) lm(x ~ : error in evaluating the argument 'x' in selecting a method for function 't': Error in model.frame.default(formula = x ~ formula_variables, data = regressiondata, : variable lengths differ (found for 'formula_variables')
Upvotes: 0
Views: 383
Reputation: 4024
Here is a dplyr way of doing things. The bulk of the work is done by the sub_regression function, which conducts a regression, filters independent variables by vif, and then redoes the regression
library(dplyr)
library(tidyr)
library(magrittr)
library(car)
sub_regression = function(sub_data_frame)
lm(independent_value ~ var1+var2+var3+var4+var5,
data = sub_data_frame ,
na.action="na.exclude") %>%
vif %>%
Filter(function(x) x <= 10, .) %>%
names %>%
paste(collapse = " + ") %>%
paste("independent_value ~ ", .) %>%
as.formula %>%
lm(. , sub_data_frame, na.action="na.exclude") %>%
coefficients %>%
round(3) %>%
as.list %>%
data.frame(check.names = FALSE)
matrix(ncol=9,nrow=100,runif(900,1,100)) %>%
data.frame %>%
setNames(c("indep1","indep2","indep3","indep4","var1","var2","var3","var4","var5")) %>%
gather(independent_variable, independent_value,
indep1, indep2, indep3, indep4) %>%
group_by(independent_variable) %>%
do(sub_regression(.))
Upvotes: 0
Reputation: 11
I think you have a couple of issues:
Here is my walk through. Your code refers to some missing objects so I have added in some lines I think you left out
library(car) # for fiv()
regressiondata <- data.frame(matrix(ncol=9,nrow=100,runif(900,1,100)))
colnames(regressiondata) <- c("indep1",
"indep2",
"indep3",
"indep4",
"var1",
"var2",
"var3",
"var4",
"var5")
indep_variables <- names(regressiondata)[1:4] # object did not exist
I broke out the anonymous functions for clarity:
f1 <- function(x) {
vif(lm(x~var1+var2+var3+var4+var5,
data = regressiondata,
na.action=na.exclude))
}
Now your regressions
vifs1_model <- sapply(regressiondata[,indep_variables], f1)
vifs1 <- rowMeans(vifs1_model)
formula_variables <- paste(names(vifs1),collapse="+")
I named this function that pulls the coefficients and handed lm a character vector (string) with the whole formula:
getCoefs <- function(x) {
lm(paste(x, "~", formula_variables), data=regressiondata,
na.action=na.exclude)$coef
}
Now, just sapply over the vector of names, then transpose and round:
final_model <- sapply(indep_variables, getCoefs)
final_model <- t(round(final_model ,2))
Upvotes: 1