Reputation: 235
When I perform a regression in R and use type factor it helps me avoid setting up the categorical variables in the data. But how do I remove a factor that is not significant from the regression to just show significant variables?
For example:
dependent <- c(1:10)
independent1 <- as.factor(c('d','a','a','a','a','a','a','b','b','c'))
independent2 <- c(-0.71,0.30,1.32,0.30,2.78,0.85,-0.25,-1.08,-0.94,1.33)
output <- lm(dependent ~ independent1+independent2)
summary(output)
Which results in the following regression model:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6180 1.0398 4.441 0.00676 **
independent1b 3.7471 2.1477 1.745 0.14148
independent1c 5.5597 2.0736 2.681 0.04376 *
independent1d -3.7129 2.3984 -1.548 0.18230
independent2 -0.1336 0.7880 -0.170 0.87203
If I want to pull out the independent1 levels that are insignificant (b,d) is there a way that I can do that?
In this case setting up the data to have categorical variables is easy but when I'm including week numbers or another factor with a lot of levels it becomes inconvenient.
Here is the way to build the model using categorial variables. As you can see it ends up being more of a pain to structure the data but also gives me more control.
regressionData <- data.frame(cbind(1:10,c(-0.71,0.30,1.32,0.30,2.78,0.85,-0.25,-1.08,-0.94,1.33),c(0,1,1,1,1,1,1,0,0,0),c(0,0,0,0,0,0,0,1,1,0),c(0,0,0,0,0,0,0,0,0,1),c(1,0,0,0,0,0,0,0,0,0)))
names(output) = c('dependent','independent2','independenta', 'independentb','independentc','independentd')
attach(regressionData)
result <- lm(dependent~independent2+independentb+independentc+independentd)
summary(result)
Now I can remove independent2 since it's insignificant
result <- lm(dependent~independentb+independentc+independentd)
summary(result)
I'll remove independentd since it's not significant
result <- lm(dependent~independentb+independentc)
summary(result)
But in this case the Adjusted R Squared drops (I'm not even going to do the partial F-test) since it would be significant, but in many cases this is not true and I need to remove the categorical from the regression because it's eating up degrees of freedom which are important in this case and potential masking the value of other variables that are significant.
Upvotes: 4
Views: 27288
Reputation: 31
You can remove the levels of the factor variables using the option exclude
:
lm(dependent ~ factor(independent1, exclude=c('b','d')) + independent2)
This way the factors b, d will not be included in the regression.
Cheers
Upvotes: 3
Reputation: 226182
If you're willing to take just the coefficent table and not the whole summary, you can just do this:
Extract the whole coefficient table:
ss <- coef(summary(output))
Take only the rows you want:
ss_sig <- ss[ss[,"Pr(>|t|)"]<0.05,]
printCoefmat
pretty-prints coefficient tables with significance stars etc.
> printCoefmat(ss_sig)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6180 1.0398 4.4414 0.006756 **
independent1c 5.5597 2.0736 2.6811 0.043760 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(This answer is similar to @Jilber's except that it automatically finds the non-significant rows for you rather than asking you to specify them manually.)
However, I have to agree with @Charlie's comment above that this is bad statistical practice ... dichotomizes the predictors artificially into significant/non-significant (predictors with p=0.049 and p=0.051 will be treated differently), and especially bad with categorical predictors where the particular set of parameters that are significant will depend on the contrasts/which level is use as the baseline ...
Upvotes: 1
Reputation: 61154
If you only want to remove the non-significant levels from the output but include them for the estimation you just can use the coeftest
function from AER
package and then with properly indexig you'll get what you want.
library(AER)
coeftest(output)[-c(2,4), ]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6180039 1.0397726 4.4413595 0.006756325
independent1c 5.5596699 2.0736190 2.6811434 0.043760158
independent2 -0.1335893 0.7880382 -0.1695214 0.872031752
If you don't feel like using AER
package you can also do the following:
summary(output)$coefficients[-c(2,4),]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6180039 1.0397726 4.4413595 0.006756325
independent1c 5.5596699 2.0736190 2.6811434 0.043760158
independent2 -0.1335893 0.7880382 -0.1695214 0.872031752
I prefer the last one since you don't need to install an additional package.
I don't know if this is what you're looking for.
Upvotes: 1