Reputation: 2085
I have performed best subset selection of linear regression model using leaps::regsubsets
. Then I chose the model with 14 predictors and using coef(model, 14)
gave me the following output:
structure(c(16.1303774392893, -0.0787496652705482, -0.104929454314886,
-1.22322411065346, 1.14718778105312, 0.75468065020279, 0.455617836039703,
0.521951041899427, 0.0124590834643436, -0.0002293804247409,
1.26667965342874e-07, 1.4002805624594e-06, -9.90560347112683e-07,
1.8809273394337e-06, 5.48249071436573e-07), .Names = c("(Intercept)", "X1",
"X2", "poly(X4, 2)1", "poly(X5, 2)1", "poly(X6, 2)2", "poly(X7, 2)2",
"poly(X9, 2)1", "X10", "X12", "X13", "X14", "X16", "X17", "X18"))
To get this model, I need to fit it with lm
. As poly(X, 2)1
is linear and poly(X, 2)2
is quadratic, I did:
lm(X20 ~ X1 + X2 + X4 + X5 + I(X6 ^ 2) + I(X7 ^ 2) +
X9 + X10 + X12 + X13 + X14 + X16 + X17 + X18, df)
I think I know why coefficients are different (see poly() in lm(): difference between raw vs. orthogonal), but why don't they give the same fitted values and adjusted R2?
Of course, using poly(X, 2)[,2]
in the formula gives complete consistency with regsubsets
output. But is it valid to use only second term orthogonal polynomial and specify the model as follows?
lm(X20 ~ X1 + X2 + X4 + X5 + poly(X6, 2)[,2] + poly(X7, 2)[,2] +
X9 + X10 + X12 + X13 + X14 + X16 + X17 + X18, df)
Is there more direct way to retrieve single model from regsubsets
output than specifying the model by hand?
Upvotes: 1
Views: 1189
Reputation: 73315
but why don't they give the same fitted values and adjusted R2?
Fitted values won't necessarily be the same, if you don't use all columns from poly
.
set.seed(0)
y <- runif(100)
x <- runif(100)
X <- poly(x, 3)
all.equal(lm(y ~ X)$fitted, lm(y ~ x + I(x ^ 2) + I(x ^ 3))$fitted)
#[1] TRUE
all.equal(lm(y ~ X[, 1:2])$fitted, lm(y ~ x + I(x ^ 2))$fitted)
#[1] TRUE
all.equal(lm(y ~ X - 1)$fitted, lm(y ~ x + I(x ^ 2) + I(x ^ 3) - 1)$fitted) ## no intercept
#[1] "Mean relative difference: 33.023"
all.equal(lm(y ~ X[, c(1, 3)])$fitted, lm(y ~ x + I(x ^ 3))$fitted)
#[1] "Mean relative difference: 0.03008166"
all.equal(lm(y ~ X[, c(2, 3)])$fitted, lm(y ~ I(x ^ 2) + I(x ^ 3))$fitted)
#[1] "Mean relative difference: 0.03297488"
We only have ~ 1 + poly(x, degree)[, 1:k]
equivalent to ~ 1 + x + I(x ^ 2) + ... + I(x ^ k)
, for any k <= degree
. (I explicitly write out the intercept, to emphasize that we have to start from polynomial of degree 0.)
(The reason is related to how an orthogonal polynomial is generated. See How `poly()` generates orthogonal polynomials? How to understand the "coefs" returned? for great great details. Note that when doing a QR factorization X = QR
, as R
is an upper triangular matrix (not a diagonal matrix), Q[, ind]
will not have the same column space with X[, ind]
for an arbitrary subset ind
, unless ind = 1:k
.)
So, I(x ^ 2)
is not equivalent to ploy(x, 2)[, 2]
, and you will get different fitted values hence (adjusted) R2.
is it valid to use only second term orthogonal polynomial and specify the model as follows?
It is really a bad idea for leaps
(or generally any modeler) to drop columns from an orthogonal polynomial. An orthogonal polynomial is a factor-alike term, whose significance is determined by F-statistics (i.e., treating all columns as a whole), rather than t-statistics for individual columns.
In fact, even for raw polynomials, it is not a good idea to omit any low order term. For example, y ~ 1 + I(x ^ 2)
omitting linear term is not a good idea. A basic problem here is that it is not invariant to linear shift. For example, if we shift x
for x1
:
shift <- runif(1) ## an arbitrary value; can be `mean(x)`
x1 <- x - shift
then y ~ 1 + I(x ^ 2)
is not equivalent to y ~ 1 + I(x1 ^ 2)
, but still, y ~ 1 + x + I(x ^ 2)
is equivalent to y ~ 1 + x1 + I(x1 ^ 2)
.
all.equal(lm(y ~ 1 + I(x ^ 2))$fitted, lm(y ~ 1 + I(x1 ^ 2))$fitted)
#[1] "Mean relative difference: 0.02020984"
all.equal(lm(y ~ 1 + x + I(x ^ 2))$fitted, lm(y ~ 1 + x1 + I(x1 ^ 2))$fitted)
#[1] TRUE
I briefly mentioned the issue of dropping columns at R: How to or should I drop an insignificant orthogonal polynomial basis in a linear model?, but my examples here give you more insight.
Is there more direct way to retrieve single model from
regsubsets
output than specifying the model by hand?
I don't know; at least I did not figure it out almost 2 years ago when answering this thread: Get all models from leaps regsubsets.
One remaining question though. Assuming that
leaps
returnspoly(X, 2)1
I should definitely retainpoly(X, 2)1
in my model. But what if onlypoly(X, 2)1
is returned byleaps
? Can higher order term can be dropped then?
There is no problem dropping higher order terms (in this case where you originally fitted a quadratic polynomial). As I said, we have equivalence for ind = 1:j
, where j <= degree
. But make sure you understand this. Take the following two examples.
leaps
drops poly(x, 5)3
and poly(x, 5)5
. you can safely remove poly(x, 5)5
, but are still advised to retain poly(x, 5)3
. This is, instead of fitting an 5-th order polynomial, you fit a 4-th order one.leaps
drops poly(x, 6)3
and poly(x, 6)5
. Since poly(x, 6)6
is not dropped, you are advised to drop no terms at all.Upvotes: 4