Reputation: 131
My test datasets are
> y
DLogPrice
[1,] 3.4232680
[2,] -1.0099196
[3,] 0.7867983
[4,] -1.2224441
[5,] 3.5718083
[6,] -0.4550516
[7,] 1.6733032
[8,] 1.6540079
[9,] 0.6122239
[10,] -1.3530304
[11,] -18.9058749
[12,] 15.6916978
[13,] 1.9088818
and
> X
DQ DQInverseSize DQLogSize DQSize DQSize2
[1,] 2 1.925926e-05 23.10281197 208000 2.1664e+10
[2,] -2 -1.851852e-05 -23.17977301 -216000 -2.3328e+10
[3,] 1 9.259259e-06 11.58988651 108000 1.1664e+10
[4,] -1 -8.000000e-06 -11.73606902 -125000 -1.5625e+10
[5,] 2 1.600000e-05 23.47213803 250000 3.1250e+10
[6,] -1 -8.000000e-06 -11.73606902 -125000 -1.5625e+10
[7,] 1 9.259259e-06 11.58988651 108000 1.1664e+10
[8,] -2 -1.878307e-05 -23.15160214 -213000 -2.2689e+10
[9,] 0 0.000000e+00 0.00000000 0 0.0000e+00
[10,] 0 -4.761905e-07 0.04879016 5000 1.0250e+09
[11,] 0 9.090909e-07 -0.09531018 -10000 -2.1000e+09
[12,] 0 -9.090909e-07 0.09531018 10000 2.1000e+09
[13,] 2 1.869565e-05 23.16561287 215000 2.3225e+10
Solving for the linear regression coefficient vector b in y = Xb
solve(t(X) %*% X) %*% (t(X) %*% y)
results in the singularity error when trying to invert X'X:
> solve(t(X) %*% X) %*% (t(X) %*% y)
Error in solve.default(t(X) %*% X) :
system is computationally singular: reciprocal condition number = 8.6658e-43
I don't understand why then stats::lm() does report a result despite setting singularity.ok = FALSE
as in
> df <- data.frame(y,X)
>
> test.lm <- stats::lm(DLogPrice ~ DQ + DQInverseSize + DQLogSize + DQSize + DQSize2 -1, data=df, singular.ok = FALSE)
> summary(test.lm)
Call:
stats::lm(formula = DLogPrice ~ DQ + DQInverseSize + DQLogSize +
DQSize + DQSize2 - 1, data = df, singular.ok = FALSE)
Residuals:
Min 1Q Median 3Q Max
-4.2934 -1.6251 0.2413 1.0087 6.0501
Coefficients:
Estimate Std. Error t value Pr(>|t|)
DQ -4.492e+07 1.956e+07 -2.297 0.0507 .
DQInverseSize 1.503e+11 6.497e+10 2.314 0.0494 *
DQLogSize 4.038e+06 1.759e+06 2.295 0.0509 .
DQSize -3.606e+01 1.585e+01 -2.275 0.0524 .
DQSize2 5.353e-05 2.374e-05 2.255 0.0542 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.618 on 8 degrees of freedom
Multiple R-squared: 0.8371, Adjusted R-squared: 0.7353
F-statistic: 8.222 on 5 and 8 DF, p-value: 0.005156
What am I missing/misinterpreting here? Thanks for your thoughts.
Upvotes: 1
Views: 279
Reputation: 1306
Under the hood, R's lm.fit()
function uses a QR decomposition that enables it to handle nearly-singular cases like this more reliably:
y <-
tibble::tribble(
~DLogPrice,
3.4232680,
-1.0099196,
0.7867983,
-1.2224441,
3.5718083,
-0.4550516,
1.6733032,
1.6540079,
0.6122239,
-1.3530304,
18.9058749,
15.6916978,
1.9088818
)
x <- tibble::tribble(
~DQ, ~DQInverseSize, ~DQLogSize, ~DQSize, ~DQSize2,
2, 1.925926e-05, 23.10281197, 208000, 2.1664e+10,
-2, -1.851852e-05, -23.17977301, -216000, -2.3328e+10,
1, 9.259259e-06, 11.58988651, 108000, 1.1664e+10,
-1, -8.000000e-06, -11.73606902, -125000, -1.5625e+10,
2, 1.600000e-05, 23.47213803, 250000, 3.1250e+10,
-1, -8.000000e-06, -11.73606902, -125000, -1.5625e+10,
1, 9.259259e-06, 11.58988651, 108000, 1.1664e+10,
-2, -1.878307e-05, -23.15160214, -213000, -2.2689e+10,
0, 0.000000e+00, 0.00000000, 0, 0.0000e+00,
0, -4.761905e-07, 0.04879016, 5000, 1.0250e+09,
0, 9.090909e-07, -0.09531018, -10000, -2.1000e+09,
0, -9.090909e-07, 0.09531018, 10000, 2.1000e+09,
2, 1.869565e-05, 23.16561287, 215000, 2.3225e+10
)
qr.solve(cbind(1, x), y)
#> 1 DQ DQInverseSize DQLogSize DQSize
#> 3.547059e+00 -1.853857e+07 6.137535e+10 1.668251e+06 -1.508519e+01
#> DQSize2
#> 2.268808e-05
coef(lm(as.matrix(y) ~ as.matrix(x)))
#> (Intercept) as.matrix(x)DQ as.matrix(x)DQInverseSize
#> 3.547059e+00 -1.853857e+07 6.137535e+10
#> as.matrix(x)DQLogSize as.matrix(x)DQSize as.matrix(x)DQSize2
#> 1.668251e+06 -1.508519e+01 2.268808e-05
Created on 2021-06-01 by the reprex package (v2.0.0)
The textbook definition for OLS solve(t(x) %*% x) %*% t(x) %*% y
is computationally very inefficient and not a good way to implement OLS.
Upvotes: 2