Reputation: 203
I am trying to run a linear regression with two independent variables (lot and method) and an dependent variable (conc). When I run the regression I get an NA value for one of coefficients. When I change the order of the independent variables in the model, the NA value shows up for different variables. Here is a reproducible data set:
library(tidyverse)
conc <- c(0.541666667, 0.571759259, 0.50462963,0.50462963,0.377314815,0.578703704,0.518518519,0.550925926,0.548611111,0.611111111,0.567550895,0.743368291,0.669339914,0.57063541,0.5490438,0.653917335,0.610734115,0.626156693,0.721776681,0.650832819,0.731481481,0.80787037,0.75,0.733796296,0.75,0.842592593,0.722222222,0.793981481,0.789636027,0.943861814,0.959284392,0.928439235,0.838988279,0.876002468,0.993214065,0.863664405,0.75,0.673611111,0.722222222,0.717592593,0.613425926,0.795805059,0.808143122,0.826650216,0.768044417,0.80197409)
lot <- c(rep(2, 20), rep(3, 16), rep(4, 10))
method <- c(rep(1, 20), rep(2, 26))
data <- data.frame(conc, lot, method) %>%
mutate(lot = as.factor(lot)) %>%
mutate(method = as.factor(method))
When I run the regression with the lot variable first, I get an NA value for "method2"
conc_lm1 <- lm(conc ~ lot + method + lot*method, data = data)
conc_lm1
Call:
lm(formula = conc ~ lot + method + lot * method, data = data)
Coefficients:
(Intercept) lot3 lot4 method2 lot3:method2 lot4:method2
0.5836 0.2493 0.1642 NA NA NA
When I run the regression with the method variable first, I get an NA for "lot4"
conc_lm2 <- lm(conc ~ method + lot + lot*method, data = data)
conc_lm2
Call:
lm(formula = conc ~ method + lot + lot * method, data = data)
Coefficients:
(Intercept) method2 lot3 lot4 method2:lot3 method2:lot4
0.58356 0.16419 0.08507 NA NA NA
I've done some research on why this might be happening but I'm not sure I'm completely clear. This post (https://stats.stackexchange.com/questions/25804/why-would-r-return-na-as-a-lm-coefficient) suggests the issue might occur because my method and lot variables are linearly related? Any clarification would be much appreciated!
Upvotes: 3
Views: 1025
Reputation: 46888
Your variables method
and lot
are categorical, and you can check how they occur or co-occur in your data:
table(data$method,data$lot)
2 3 4
1 20 0 0
2 0 16 10
If you look at the table above, you see that all your observations have that 1 as method, have only 2 as lot. And the same goes for method 3 and 4 only occurring in method 2.
This means that we cannot distinguish the effects of method 2 and lot 1, as they always go hand in hand in your data. Likewise, we cannot calculate the coefficient of all lot3, lot4 and method 2. We can try one simple model without the interactions:
coefficients(lm(conc ~method+lot,data=data))
(Intercept) method2 lot3 lot4
0.58356132 0.16418556 0.08506782 NA
If you put method first, method 1 is the reference and after estimating method 2 coefficient, there no observations to tell you what lot3 and lot4 are.
If you take lot first, lot 2 is set as reference and the model estimates 3 and 4, and you cannot estimate method 2.
coefficients(lm(conc ~lot+method,data=data))
(Intercept) lot3 lot4 method2
0.5835613 0.2492534 0.1641856 NA
When it's like this, you definitely cannot calculate the coefficients for interaction effects.
Upvotes: 1
Reputation: 670
You don't have all the combinations of lot
and method
to estimate the coefficients in the model. For example, you have no combinations of lot=2
and method=2
. If you replace your definition of lot
with this:
lot <- c(rep(2, 7), rep(3, 7), rep(4, 6), rep(2, 9), rep(3, 9), rep(4, 8))
You will get estimates of the coefficients for all terms in the model:
summary(lm(conc ~ lot * method, data = data)) #in R, terms in an interaction automatically have their direct effects estimated
Call:
lm(formula = conc ~ lot * method, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.196063 -0.037004 0.003474 0.049869 0.134576
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.513889 0.027342 18.795 < 2e-16 ***
lot3 0.094903 0.038668 2.454 0.0186 *
lot4 0.121521 0.040246 3.019 0.0044 **
method2 0.255176 0.036456 7.000 1.88e-08 ***
lot3:method2 0.005707 0.051557 0.111 0.9124
lot4:method2 -0.133854 0.053436 -2.505 0.0164 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07234 on 40 degrees of freedom
Multiple R-squared: 0.7569, Adjusted R-squared: 0.7266
F-statistic: 24.91 on 5 and 40 DF, p-value: 2.571e-11
However, I would caution you to think about whether an interaction of two "dummy variables" really makes sense for your dataset, as I don't understand the context.
Upvotes: 2