kwh
kwh

Reputation: 203

Unsure why getting NA for coefficient in linear regression

I am trying to run a linear regression with two independent variables (lot and method) and an dependent variable (conc). When I run the regression I get an NA value for one of coefficients. When I change the order of the independent variables in the model, the NA value shows up for different variables. Here is a reproducible data set:

 library(tidyverse)

conc <- c(0.541666667, 0.571759259, 0.50462963,0.50462963,0.377314815,0.578703704,0.518518519,0.550925926,0.548611111,0.611111111,0.567550895,0.743368291,0.669339914,0.57063541,0.5490438,0.653917335,0.610734115,0.626156693,0.721776681,0.650832819,0.731481481,0.80787037,0.75,0.733796296,0.75,0.842592593,0.722222222,0.793981481,0.789636027,0.943861814,0.959284392,0.928439235,0.838988279,0.876002468,0.993214065,0.863664405,0.75,0.673611111,0.722222222,0.717592593,0.613425926,0.795805059,0.808143122,0.826650216,0.768044417,0.80197409)
lot <- c(rep(2, 20), rep(3, 16), rep(4, 10))
method <- c(rep(1, 20), rep(2, 26))

data <- data.frame(conc, lot, method) %>% 
  mutate(lot = as.factor(lot)) %>% 
  mutate(method = as.factor(method))

When I run the regression with the lot variable first, I get an NA value for "method2"

conc_lm1 <- lm(conc ~ lot + method + lot*method, data = data)
conc_lm1

Call:
lm(formula = conc ~ lot + method + lot * method, data = data)

Coefficients:
 (Intercept)          lot3          lot4       method2  lot3:method2  lot4:method2  
      0.5836        0.2493        0.1642            NA            NA            NA  

When I run the regression with the method variable first, I get an NA for "lot4"

conc_lm2 <- lm(conc ~ method + lot + lot*method, data = data)
conc_lm2

Call:
lm(formula = conc ~ method + lot + lot * method, data = data)

Coefficients:
 (Intercept)       method2          lot3          lot4  method2:lot3  method2:lot4  
     0.58356       0.16419       0.08507            NA            NA            NA   

I've done some research on why this might be happening but I'm not sure I'm completely clear. This post (https://stats.stackexchange.com/questions/25804/why-would-r-return-na-as-a-lm-coefficient) suggests the issue might occur because my method and lot variables are linearly related? Any clarification would be much appreciated!

Upvotes: 3

Views: 1025

Answers (2)

StupidWolf
StupidWolf

Reputation: 46888

Your variables method and lot are categorical, and you can check how they occur or co-occur in your data:

table(data$method,data$lot)

     2  3  4
  1 20  0  0
  2  0 16 10

If you look at the table above, you see that all your observations have that 1 as method, have only 2 as lot. And the same goes for method 3 and 4 only occurring in method 2.

This means that we cannot distinguish the effects of method 2 and lot 1, as they always go hand in hand in your data. Likewise, we cannot calculate the coefficient of all lot3, lot4 and method 2. We can try one simple model without the interactions:

coefficients(lm(conc ~method+lot,data=data))
(Intercept)     method2        lot3        lot4
 0.58356132  0.16418556  0.08506782          NA

If you put method first, method 1 is the reference and after estimating method 2 coefficient, there no observations to tell you what lot3 and lot4 are.

If you take lot first, lot 2 is set as reference and the model estimates 3 and 4, and you cannot estimate method 2.

 coefficients(lm(conc ~lot+method,data=data))
(Intercept)        lot3        lot4     method2
  0.5835613   0.2492534   0.1641856          NA

When it's like this, you definitely cannot calculate the coefficients for interaction effects.

Upvotes: 1

ThetaFC
ThetaFC

Reputation: 670

You don't have all the combinations of lot and method to estimate the coefficients in the model. For example, you have no combinations of lot=2 and method=2. If you replace your definition of lot with this:

lot <- c(rep(2, 7), rep(3, 7), rep(4, 6), rep(2, 9), rep(3, 9), rep(4, 8))

You will get estimates of the coefficients for all terms in the model:

summary(lm(conc ~ lot * method, data = data)) #in R, terms in an interaction automatically have their direct effects estimated

Call:
lm(formula = conc ~ lot * method, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.196063 -0.037004  0.003474  0.049869  0.134576 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.513889   0.027342  18.795  < 2e-16 ***
lot3          0.094903   0.038668   2.454   0.0186 *  
lot4          0.121521   0.040246   3.019   0.0044 ** 
method2       0.255176   0.036456   7.000 1.88e-08 ***
lot3:method2  0.005707   0.051557   0.111   0.9124    
lot4:method2 -0.133854   0.053436  -2.505   0.0164 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07234 on 40 degrees of freedom
Multiple R-squared:  0.7569,    Adjusted R-squared:  0.7266 
F-statistic: 24.91 on 5 and 40 DF,  p-value: 2.571e-11

However, I would caution you to think about whether an interaction of two "dummy variables" really makes sense for your dataset, as I don't understand the context.

Upvotes: 2

Related Questions