Christoph
Christoph

Reputation: 7063

How does R's lm function handle factor levels (in C_Cdqrls?)?

Or in other words: which algorithm is used in this case? I guess they use discriminant analysis as discribed e.g. in chapter 4.4. in James et. al. "An Introduction to Statistical Learning with Applications in R"?

After input from comments I could also restate the question as follows:

I found quite useful explanations in this article, but at some point it only states

... This [factor] deconstruction can be a complex task, so we will not go into details lest it take us too far afield...

I could not find anything in the documentation and I was not able to understand what's going on using debug(lm) What I understand using a reproducible example:

n <- 10
p <- 6
set.seed(1)
x <- seq(0, 20, length.out = n) + rnorm(n, 0, 1)
y <- c(1:3)
y <- sample(y, n, replace = TRUE)
z <- 10*y*x + 10*y + 10 + rnorm(n, 0, 1)
debug(lm)
fit <- lm(z ~ x*y)

After mt <- attr(mf, "terms") it looks like

mt
# ...
# attr(,"dataClasses")
#         z         x         y 
# "numeric" "numeric" "numeric" 

whereas after

n <- 10
p <- 6
set.seed(1)
x <- seq(0, 20, length.out = n) + rnorm(n, 0, 1)
y <- c(1:3)
y <- sample(y, n, replace = TRUE)
z <- 10*y*x + 10*y + 10 + rnorm(n, 0, 1)
y <- as.factor(y)
debug(lm)
fit <- lm(z ~ x*y)

and mt <- attr(mf, "terms") looks like

mt
# ...
# attr(,"dataClasses")
#         z         x         y 
# "numeric" "numeric"  "factor"

But then it seems, they always call lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) and there z <- .Call(C_Cdqrls, x, y, tol, FALSE) which I thought only works without factors. The link above nicely explains everything down to the model matrix and qr-decomposition which I thought does not work in case of factors.

Edit: The model matrix after x <- model.matrix(mt, mf, contrasts) already differs. In case of numerics

x
   (Intercept)          x y       x:y
1            1 -0.6264538 3 -1.879361
2            1  2.4058655 1  2.405866
3            1  3.6088158 2  7.217632
4            1  8.2619475 1  8.261947
5            1  9.2183967 1  9.218397
6            1 10.2906427 2 20.581285
7            1 13.8207624 1 13.820762
8            1 16.2938803 2 32.587761
9            1 18.3535591 3 55.060677
10           1 19.6946116 2 39.389223
attr(,"assign")
[1] 0 1 2 3

In case of factors

x
   (Intercept)          x y2 y3      x:y2       x:y3
1            1 -0.6264538  0  1  0.000000 -0.6264538
2            1  2.4058655  0  0  0.000000  0.0000000
3            1  3.6088158  1  0  3.608816  0.0000000
4            1  8.2619475  0  0  0.000000  0.0000000
5            1  9.2183967  0  0  0.000000  0.0000000
6            1 10.2906427  1  0 10.290643  0.0000000
7            1 13.8207624  0  0  0.000000  0.0000000
8            1 16.2938803  1  0 16.293880  0.0000000
9            1 18.3535591  0  1  0.000000 18.3535591
10           1 19.6946116  1  0 19.694612  0.0000000
attr(,"assign")
[1] 0 1 2 2 3 3
attr(,"contrasts")
attr(,"contrasts")$`y`
[1] "contr.treatment"

Edit 2: Part of the question can be also found here

Upvotes: 5

Views: 683

Answers (1)

Christoph
Christoph

Reputation: 7063

With the help of the answers to this question, I realized, that the answer is simple:

If the factors belong to the variables (predictor variable), the model.matrix just gets larger. Thus is it is clear, that C_Cdqrls can handle the model matrix.

Only if the dependent variable(s) contain factors, linear regression or lm doesn't work properly and discriminant analysis is one possibility. (At a first glance it seems, that stats::glm uses a logit model.

From Wikipedia:

Discriminant function analysis is very similar to logistic regression, and both can be used to answer the same research questions. Logistic regression does not have as many assumptions and restrictions as discriminant analysis. However, when discriminant analysis’ assumptions are met, it is more powerful than logistic regression. Unlike logistic regression, discriminant analysis can be used with small sample sizes. It has been shown that when sample sizes are equal, and homogeneity of variance/covariance holds, discriminant analysis is more accurate. With all this being considered, logistic regression has become the common choice, since the assumptions of discriminant analysis are rarely met.


Example:

x <- seq(0, 10, length.out = 21)
y <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
y <- as.factor(y)
df <- data.frame(x = x, y = y)

# see ??numeric and the ‘Warning’ section in factor:
plot(x, as.numeric(levels(y))[y], ylim = c(0, 1.2))

fit <- lm(y ~ x, data = df)
print(summary(fit))

fit_glm <- stats::glm(y ~ x, family = binomial(link = "logit"), data = df, control = list(maxit = 50))
print(summary(fit_glm))

df$glm.probs <- stats::predict(fit_glm, newdata = df, type = "response")
df$glm.pred = ifelse(glm.probs > 0.5, 1, 0)
points(x, df$glm.pred + 0.05, col = "red")

Upvotes: 0

Related Questions