Reputation: 827
I'm trying to estimate factors that determine the difference in happiness level between people living in New York and Chicago.
Data looks like below.
Happiness City Gender Employment Worktype Holiday
1 60 New York 0 0 Unemployed Unemployed
2 80 Chicago 1 1 Whitecolor 1 day a week
3 39 Chicago 0 0 Unemployed Unemployed
4 40 New York 1 0 Unemployed Unemployed
5 69 Chicago 1 1 Bluecolor 2 day a week
6 90 Chicago 1 1 Bluecolor 2 day a week
7 100 New York 0 1 Whitecolor 2 day a week
8 30 New York 1 1 Whitecolor 1 day a week
Happiness level is dependent variable, and 'city' is where the person lives. 'Gender' is coded 0 = man 1 = woman. 'Employment' is 0 = Unemployed and 1 = Employed. 'Worktype' is three level factor: 'Unemployed', 'Whitecolor', 'Bluecolor'. 'Holiday' is how many days a person rest in a week. Here 'City', 'Gender', 'Worktype' and 'Holiday' variables are all factors. 'Happiness' and 'Employment' variable types are numerical.
The Model I want to estimate is
lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
I left 'Employment' value as numerical value so if 'Employment' is equal to 0(Unemployed), 0:(Worktype + Holiday) = 0, so the model is automatically reduced to
lm(Happiness ~ City + Gender)
for unemployed people.
However, regression result returns NA values.
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.75 23.56 2.408 0.138
CityNew York -14.50 27.21 -0.533 0.647
Gender1 -2.25 35.99 -0.063 0.956
Employment:WorktypeBluecolor 25.00 43.02 0.581 0.620
Employment:WorktypeUnemployed NA NA NA NA
Employment:WorktypeWhitecolor 57.75 35.99 1.604 0.250
Employment:Holiday1 day a week -50.00 54.42 -0.919 0.455
Employment:Holiday2 day a week NA NA NA NA
this seems to be due to 'Unemployment' value in 'Worktype' and 'Holiday' variable. However, I am not sure why R is not treating Employment:WorktypeUnemployed which is obviously 0:Worktype = 0 as zero and not removing it from the model. Is this because R is setting Employment:HolidayUnemployed as a baseline and both are perfectly multicollinear? (I had to put 'Unemployed' value for 'Worktype' and 'Holiday' because I wanted to see the effect of 'Worktype' and 'Holiday' compared to 'Unemployed' people. If I remove 'Unemployed' value NA disappears, but baseline will be 'Whitecolor' and '1day a week' so I cannot see the effect compared to 'unemployed'.)
If so, Why am I getting NA for coefficients for 'Employement:Holiday2 day a week'? It seems that it has nothing to do with 'Unemployed' value.
Can I rely on this result while just removing NA coefficients?
below are reproducible code.
Happiness <- c(60, 80, 39, 40, 69, 90, 100, 30)
City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago",
"Chicago", "New York", "New York"))
Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
Worktype <- as.factor(c("Unemployed", "Whitecolor", "Unemployed",
"Unemployed", "Bluecolor", "Bluecolor", "Whitecolor","Whitecolor"))
Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")
data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)
head(data,8)
str(data)
reg <- lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
summary(reg)
Upvotes: 1
Views: 6577
Reputation: 311
You shouldn't worry about the NA values for Employment:WorktypeUnemployed
. R tries automatically to compute all the interactions, but that particular coefficient remains undetermined because, clearly, it is never the case that Employment=1 and Worktype="Unemployed". It does not have any effect on the computations of the other coefficients: you can verify by manually coding the dummy variables:
> library(lme4) # for the convenient "dummy" function
> data <- data.frame(data,
+ dummy(Worktype, c("Bluecolor","Whitecolor")),
+ h1=dummy(Holiday)[,1],
+ h2=dummy(Holiday)[,2])
>
> reg <- lm(Happiness ~ City + Gender + Employment:Bluecolor + Employment:Whitecolor + Employment:h1 + Employment:h2 , data)
> summary(reg)
Call:
lm(formula = Happiness ~ City + Gender + Employment:Bluecolor +
Employment:Whitecolor + Employment:h1 + Employment:h2, data = data)
Residuals:
1 2 3 4 5 6 7 8
1.775e+01 1.775e+01 -1.775e+01 8.882e-16 -1.050e+01 1.050e+01 4.441e-15 -1.775e+01
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.75 23.56 2.408 0.138
CityNew York -14.50 27.21 -0.533 0.647
Gender1 -2.25 35.99 -0.063 0.956
Employment:Bluecolor 25.00 43.02 0.581 0.620
Employment:Whitecolor 57.75 35.99 1.604 0.250
Employment:h1 -50.00 54.42 -0.919 0.455
Employment:h2 NA NA NA NA
Residual standard error: 27.21 on 2 degrees of freedom
Multiple R-squared: 0.6798, Adjusted R-squared: -0.1208
F-statistic: 0.8491 on 5 and 2 DF, p-value: 0.619
The estimated coefficients are identical even though Employment:WorktypeUnemployed
is not present anymore.
However, the NA values are still present for Employment:h2
(equivalent to Employment:Holiday2 day a week
). This seems due to the fact that in this reduced dataset you end up with a singular model matrix (i.e. one column is a linear combination of other columns)
> solve(crossprod(model.matrix(reg)))
Error in solve.default(crossprod(model.matrix(reg))) :
system is computationally singular: reciprocal condition number = 1.79897e-18
So this issue may not be present with a larger dataset. Eventually, you could try to drop any redundancy in the model (e.g., are there any employed with 0 days per week of holiday? if not then 1 day should be the baseline, and you would add extra columns to code for days of holiday >1). You can use the alias()
function to check which term is giving the issue.
Upvotes: 4