Using gather from tidyr changes my regression results

Question

When I run the code below, everything works as expected

# install.packages("dynlm")
# install.packages("tidyr")
require(dynlm)
require(tidyr)


Time <- 1950:1993

Y <- c(5820, 5843, 5917, 6054, 6099, 6365, 6440, 6465, 6449, 6658,  6698, 6740, 6931, 
       7089, 7384, 7703, 8005, 8163, 8506, 8737, 8842, 9022, 9425,  9752, 9602, 9711, 
       10121, 10425, 10744, 10876, 10746, 10770, 10782, 11179, 11617, 12015, 12336, 
       12568, 12903, 13029, 13093, 12899, 13110, 13391)

X <- c(6284, 6390, 6476, 6640, 6628, 6879, 7080, 7114, 7113, 7256, 7264, 7382, 7583, 7718,  
       8140, 8508, 8822, 9114, 9399, 9606, 9875, 10111, 10414, 11013, 10832, 10906, 11192, 
       11406, 11851, 12039, 12005, 12156, 12146, 12349, 13029, 13258, 13552, 13545, 13890, 
       14005, 14101, 14003, 14279, 14341)

data <- data.frame(Time, Y, X)

data_ts <- ts(data, start = 1950, end = 1993, frequency = 1)

Modell <- dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + log(L(X, 3)) 
                     + log(L(X, 4)) + log(L(X, 5)), data = data_ts)
summary(Modell)

My summary output in this case is this

...        
              Estimate  Std. Error t value Pr(>|t|)    
(Intercept)  -0.059109   0.091926  -0.643    0.525    
log(X)        0.883020   0.145754   6.058 9.17e-07 ***
log(L(X))     0.004167   0.211420   0.020    0.984    
log(L(X, 2)) -0.092880   0.207026  -0.449    0.657    
log(L(X, 3)) -0.012016   0.210395  -0.057    0.955    
log(L(X, 4))  0.200596   0.212370   0.945    0.352    
log(L(X, 5))  0.014497   0.144103   0.101    0.920 
...

Now, when I use gather() to define a new data frame for some plots

data_tidyr <- gather(data, "Key", "Value", -Time)

and re-run the above code not changing anything else I get this summary:

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.05669    0.07546  -0.751    0.457    
log(X)        0.82128    0.13486   6.090 3.53e-07 ***
log(L(X))     0.17484    0.13365   1.308    0.198    
log(L(X, 2))       NA         NA      NA       NA    
log(L(X, 3))       NA         NA      NA       NA    
log(L(X, 4))       NA         NA      NA       NA    
log(L(X, 5))       NA         NA      NA       NA

I am puzzled by this behaviour as the gather operations (defining a new data frame with columns gathered into rows) has nothing do to with the data set I am using to run my regression (at least this was my impression). Somehow using gather() changes the way calculation is done, but I cannot see how. Help would be much appreciated!

Some number:

"dynlm" version 0.3-3
R version: 3.2.0 (64 bit)

Update

Ok thank you for all the answers and comments so far, but the question remains: WHAT is going on in the environment? I want to know why and how this happens. To me this is something serious, since to my understanding avoiding non-intended side-effects of one function call on others is precicly what functional languages like R are trying to achieve. Now, unless I am missing something here, this behaviour seems to be at odds with that intention.

shadow · Accepted Answer

The underlying reason for this unexpected change is that dplyr (dplyr, not tidyr) changes the default method of the lag function. The gather function calls dplyr::select_vars, which loads dplyr via namespace and overwrites lag.default.

The dynlm function internally calls lag when you use L in the formula. The method dispatch then finds lag.default. When dplyr is loaded via namespace (it does not even need to be attached), the lag.default from dplyr is found.

The two lag functions are fundamentally different. In a new R session, you will find the following difference:

lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA  1  2

So the solution is fairly simple. Just overwrite the lag.default function yourself.

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928      

lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1951, End = 1993
## 
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##  (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
##     -0.05669       0.82128       0.17484            NA  

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928

Using gather from tidyr changes my regression results

Update

Answers (2)

Related Questions