Andrew Elliott
Andrew Elliott

Reputation: 235

How do you remove an insignificant factor level from a regression using the lm() function in R?

When I perform a regression in R and use type factor it helps me avoid setting up the categorical variables in the data. But how do I remove a factor that is not significant from the regression to just show significant variables?

For example:

dependent <- c(1:10)
independent1 <- as.factor(c('d','a','a','a','a','a','a','b','b','c'))
independent2 <- c(-0.71,0.30,1.32,0.30,2.78,0.85,-0.25,-1.08,-0.94,1.33)
output <- lm(dependent ~ independent1+independent2)
summary(output)

Which results in the following regression model:

Coefficients:
          Estimate Std. Error t value Pr(>|t|)   
(Intercept)     4.6180     1.0398   4.441  0.00676 **
independent1b   3.7471     2.1477   1.745  0.14148   
independent1c   5.5597     2.0736   2.681  0.04376 * 
independent1d  -3.7129     2.3984  -1.548  0.18230   
independent2   -0.1336     0.7880  -0.170  0.87203   

If I want to pull out the independent1 levels that are insignificant (b,d) is there a way that I can do that?

In this case setting up the data to have categorical variables is easy but when I'm including week numbers or another factor with a lot of levels it becomes inconvenient.

Here is the way to build the model using categorial variables. As you can see it ends up being more of a pain to structure the data but also gives me more control.

regressionData <- data.frame(cbind(1:10,c(-0.71,0.30,1.32,0.30,2.78,0.85,-0.25,-1.08,-0.94,1.33),c(0,1,1,1,1,1,1,0,0,0),c(0,0,0,0,0,0,0,1,1,0),c(0,0,0,0,0,0,0,0,0,1),c(1,0,0,0,0,0,0,0,0,0)))

names(output) = c('dependent','independent2','independenta', 'independentb','independentc','independentd')

attach(regressionData)

result <- lm(dependent~independent2+independentb+independentc+independentd)
summary(result)

Now I can remove independent2 since it's insignificant

result <- lm(dependent~independentb+independentc+independentd)
summary(result)

I'll remove independentd since it's not significant

result <- lm(dependent~independentb+independentc)
summary(result)

But in this case the Adjusted R Squared drops (I'm not even going to do the partial F-test) since it would be significant, but in many cases this is not true and I need to remove the categorical from the regression because it's eating up degrees of freedom which are important in this case and potential masking the value of other variables that are significant.

Upvotes: 4

Views: 27288

Answers (3)

Leo
Leo

Reputation: 31

You can remove the levels of the factor variables using the option exclude:

lm(dependent ~ factor(independent1, exclude=c('b','d')) + independent2)

This way the factors b, d will not be included in the regression.

Cheers

Upvotes: 3

Ben Bolker
Ben Bolker

Reputation: 226182

If you're willing to take just the coefficent table and not the whole summary, you can just do this:

Extract the whole coefficient table:

ss <- coef(summary(output))

Take only the rows you want:

ss_sig <- ss[ss[,"Pr(>|t|)"]<0.05,]

printCoefmat pretty-prints coefficient tables with significance stars etc.

> printCoefmat(ss_sig)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)     4.6180     1.0398  4.4414 0.006756 **
independent1c   5.5597     2.0736  2.6811 0.043760 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(This answer is similar to @Jilber's except that it automatically finds the non-significant rows for you rather than asking you to specify them manually.)

However, I have to agree with @Charlie's comment above that this is bad statistical practice ... dichotomizes the predictors artificially into significant/non-significant (predictors with p=0.049 and p=0.051 will be treated differently), and especially bad with categorical predictors where the particular set of parameters that are significant will depend on the contrasts/which level is use as the baseline ...

Upvotes: 1

Jilber Urbina
Jilber Urbina

Reputation: 61154

If you only want to remove the non-significant levels from the output but include them for the estimation you just can use the coeftest function from AER package and then with properly indexig you'll get what you want.

 library(AER)
 coeftest(output)[-c(2,4), ]
                Estimate Std. Error    t value    Pr(>|t|)
(Intercept)    4.6180039  1.0397726  4.4413595 0.006756325
independent1c  5.5596699  2.0736190  2.6811434 0.043760158
independent2  -0.1335893  0.7880382 -0.1695214 0.872031752

If you don't feel like using AER package you can also do the following:

summary(output)$coefficients[-c(2,4),]
                Estimate Std. Error    t value    Pr(>|t|)
(Intercept)    4.6180039  1.0397726  4.4413595 0.006756325
independent1c  5.5596699  2.0736190  2.6811434 0.043760158
independent2  -0.1335893  0.7880382 -0.1695214 0.872031752

I prefer the last one since you don't need to install an additional package.

I don't know if this is what you're looking for.

Upvotes: 1

Related Questions