Reputation: 4145
When I run the code below, everything works as expected
# install.packages("dynlm")
# install.packages("tidyr")
require(dynlm)
require(tidyr)
Time <- 1950:1993
Y <- c(5820, 5843, 5917, 6054, 6099, 6365, 6440, 6465, 6449, 6658, 6698, 6740, 6931,
7089, 7384, 7703, 8005, 8163, 8506, 8737, 8842, 9022, 9425, 9752, 9602, 9711,
10121, 10425, 10744, 10876, 10746, 10770, 10782, 11179, 11617, 12015, 12336,
12568, 12903, 13029, 13093, 12899, 13110, 13391)
X <- c(6284, 6390, 6476, 6640, 6628, 6879, 7080, 7114, 7113, 7256, 7264, 7382, 7583, 7718,
8140, 8508, 8822, 9114, 9399, 9606, 9875, 10111, 10414, 11013, 10832, 10906, 11192,
11406, 11851, 12039, 12005, 12156, 12146, 12349, 13029, 13258, 13552, 13545, 13890,
14005, 14101, 14003, 14279, 14341)
data <- data.frame(Time, Y, X)
data_ts <- ts(data, start = 1950, end = 1993, frequency = 1)
Modell <- dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + log(L(X, 3))
+ log(L(X, 4)) + log(L(X, 5)), data = data_ts)
summary(Modell)
My summary output in this case is this
...
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.059109 0.091926 -0.643 0.525
log(X) 0.883020 0.145754 6.058 9.17e-07 ***
log(L(X)) 0.004167 0.211420 0.020 0.984
log(L(X, 2)) -0.092880 0.207026 -0.449 0.657
log(L(X, 3)) -0.012016 0.210395 -0.057 0.955
log(L(X, 4)) 0.200596 0.212370 0.945 0.352
log(L(X, 5)) 0.014497 0.144103 0.101 0.920
...
Now, when I use gather() to define a new data frame for some plots
data_tidyr <- gather(data, "Key", "Value", -Time)
and re-run the above code not changing anything else I get this summary:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.05669 0.07546 -0.751 0.457
log(X) 0.82128 0.13486 6.090 3.53e-07 ***
log(L(X)) 0.17484 0.13365 1.308 0.198
log(L(X, 2)) NA NA NA NA
log(L(X, 3)) NA NA NA NA
log(L(X, 4)) NA NA NA NA
log(L(X, 5)) NA NA NA NA
I am puzzled by this behaviour as the gather operations (defining a new data frame with columns gathered into rows) has nothing do to with the data set I am using to run my regression (at least this was my impression). Somehow using gather() changes the way calculation is done, but I cannot see how. Help would be much appreciated!
Some number:
Ok thank you for all the answers and comments so far, but the question remains: WHAT is going on in the environment? I want to know why and how this happens. To me this is something serious, since to my understanding avoiding non-intended side-effects of one function call on others is precicly what functional languages like R are trying to achieve. Now, unless I am missing something here, this behaviour seems to be at odds with that intention.
Upvotes: 3
Views: 448
Reputation: 22313
The underlying reason for this unexpected change is that dplyr
(dplyr
, not tidyr
) changes the default method of the lag
function. The gather
function calls dplyr::select_vars
, which loads dplyr
via namespace and overwrites lag.default
.
The dynlm
function internally calls lag
when you use L
in the formula. The method dispatch then finds lag.default
. When dplyr
is loaded via namespace (it does not even need to be attached), the lag.default
from dplyr
is found.
The two lag functions are fundamentally different. In a new R session, you will find the following difference:
lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA 1 2
So the solution is fairly simple. Just overwrite the lag.default
function yourself.
lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928
lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1951, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05669 0.82128 0.17484 NA
lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928
Upvotes: 7
Reputation: 6755
When I ran your first block of code in R 3.1.3 I got the results you are showing as your second set of results with this:
(R Version 3.1.3, dynlm version .3-3).
Time series regression with "ts" data:
Start = 1951, End = 1993
Call:
dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) +
log(L(X, 3)) + log(L(X, 4)) + log(L(X, 5)), data = data_ts)
Residuals:
Min 1Q Median 3Q Max
-0.030753 -0.006364 0.001321 0.007939 0.025982
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.05669 0.07546 -0.751 0.457
log(X) 0.82128 0.13486 6.090 3.53e-07 ***
log(L(X)) 0.17484 0.13365 1.308 0.198
log(L(X, 2)) NA NA NA NA
log(L(X, 3)) NA NA NA NA
log(L(X, 4)) NA NA NA NA
log(L(X, 5)) NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01419 on 40 degrees of freedom
Multiple R-squared: 0.9974, Adjusted R-squared: 0.9972
F-statistic: 7578 on 2 and 40 DF, p-value: < 2.2e-16
However when I updated to R 3.2.0 I got a repeat of what you got initially but then settled back into always getting the second results.
Now from your later comments you are also getting the second results consistently. So I think it must be that it is either that there was a typo in the code at some point or that there is something about the first time this runs in an empty environment.
Based on that hypothesis I closed RStudio completely, restarted and ran the first codeblock. In that case I got your initial result again.
So I think the answer to your question has to be there is something odd going on in the environment.
I read the documentation for dynlm and there are a few places where the defaults (if they are coming into play) would cause differences potentially. For example it will take the variables from the environment if no data is specified. It will use either a timeseries object or a dataframe. In your case you have both in the environment (data and data_ts). If you notice on the summary output that I have above it says Time series regression with "ts" data:
which means it is running with a ts
object. When I am getting the other result (without the NAs, your first result) it says Time series regression with "numeric" data:
and in that case it is running with on data
which is a dataframe or X and Y directly. So I think that must be the source of the difference. I'm not sure why exactly that would have happened with data_ts
explicitly named though.
Upvotes: -1