Martin Dallinger
Martin Dallinger

Reputation: 451

Is this a mistake in R? To my understanding the output should be the same

Recently, I was comparing two statistics exercises and found out that different outputs for the same input in R is perhaps unintended behavior of R, right?

model1 <- lm(rent ~ area + bath, data = rent99)
coefficients1 <- coef(model1)

# Using a matrix without an intercept column
X <- cbind(rent99$area, rent99$bath)
model2 <- lm(rent99$rent ~ X[, 1] + X[, 2])
coefficients2 <- coef(model2)

# Both coefficients1 and coefficients2 should be identical
coefficients1
coefficients2

Output:

(Intercept)        area       bath1 
 144.149195    4.587025  100.661413 
(Intercept)      X[, 1]      X[, 2] 
  43.487782    4.587025  100.661413

I would assume the coefficients to be identical, because the input data is identical

Upvotes: 5

Views: 82

Answers (1)

Roland
Roland

Reputation: 132969

bath is a factor variable.

Let's reproduce:

set.seed(42)
x <- sample(0:1, 100, TRUE)
DF <- data.frame(x = factor(x),
                 y = 0.1 + 5 * x + rnorm(100))

coef(lm(y ~ x, data = DF))
#(Intercept)          x1 
# 0.03815139  5.06531032 

coef(lm(DF$y ~ cbind(DF$x)))
#(Intercept) cbind(DF$x) 
#-5.027159    5.065310 

The issue is your use of cbind. It produces a matrix and a matrix can only hold one data type and it cannot hold S3 objects (such as a factor).

Thus, cbind works like as.numeric in your example:

as.numeric(DF$x)
#  [1] 1 1 1 1 2 2 2 2 1 2 1 2 1 2 1 1 2 2 2 2 1 1 1 1 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 2 2
# [76] 2 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 2 1 2 2 2 1 2 2 2

As you see, that returns the internal integers of the factor variable. Basically, you recoded that variable from 0/1 to 1/2. That's why the second intercept is 144.149195 - 1 * 100.661413.

Upvotes: 10

Related Questions