Reputation: 129
I am getting different results from lmPerm
based on the order in which I enter the variables in the function call.
For example, placing NCF.pf
before TotalProperties
yields the following:
pfit <- lmp(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)
summary(pfit)
...
Coefficients:
Estimate Iter Pr(Prob)
NCF.pf 4.581e-01 51 1
TotalProperties 5.246e+04 5000 <2e-16 ***
but, when I switch the order of the coefficients in the formula and place TotalProperties
before NCF.pf
, the p-value on NCF.pf
becomes significant
pfit2 <- lmp(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)
summary(pfit2)
...
Coefficients:
Estimate Iter Pr(Prob)
TotalProperties 5.246e+04 5000 <2e-16 ***
NCF.pf 4.581e-01 5000 <2e-16 ***
Am I missing something? Why would the p-values be different just because I switched the order of the variables in the function call?
Update - Data Source and lm
Output (11/11/2016)
The data can be found on GitHub at this link.
When calling the standard lm
function twice (reversing the order of the variables on the second call), the p-values are identical (see below). Hence, unlike when using the lmPerm
function, the order of the variables doesn't matter with lm
.
fit1 <- lm(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)
summary(fit1)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.088e+05 2.258e+05 3.138 0.0019 **
NCF.pf 4.581e-01 1.112e-01 4.121 5.11e-05 ***
TotalProperties 5.246e+04 9.519e+03 5.511 8.76e-08 ***
fit2 <- lm(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)
summary(fit2)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.088e+05 2.258e+05 3.138 0.0019 **
TotalProperties 5.246e+04 9.519e+03 5.511 8.76e-08 ***
NCF.pf 4.581e-01 1.112e-01 4.121 5.11e-05 ***
Thanks!
Upvotes: 2
Views: 470
Reputation: 73385
I already saw 2 close votes to migrate this to Cross Validated, but in my humble opinion this should stay on Stack Overflow. It is true, that t-statistic and p-value are not invariant to the specification order of terms, under the non-pivoted QR factorization strategy used by lm
and lmp
, but as shown in the new edit, for OP's data, these statistic should be invariant. So there must be something sensitive at programming level.
My quick diagnose suggests, that if we set seqs = TRUE
, rather than using the default FALSE
, we would get consistent result:
## I have subsetted data with `Presence == 1` into a new dataset `dat`
## I have also renamed variable name for simplicity
coef(summary(lmp(y ~ x1 + x2, dat, seqs = TRUE)))
# Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000 0
#x1 4.580840e-01 5000 0
#x2 5.245619e+04 5000 0
coef(summary(lmp(y ~ x2 + x1, dat, seqs = TRUE)))
# Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000 0
#x2 5.245619e+04 5000 0
#x1 4.580840e-01 5000 0
Note, the Pr(Prob)
should be "< 2e-16" when printed by summary
, but when using coef
to obtain a matrix, those tiny values are 0.
The documentation of ?lmp
mentions a little on this part:
The SS will be calculated _sequentially_, just as ‘lm()’ does; or
they may be calculated _uniquely_, which means that the SS for
each source is calculated conditionally on all other sources.
I am at the moment not sure what SS
is (as I am not a user of lmPerm
), but this sounds like that for consistent result, we should set seqs = TRUE
.
Upvotes: 3