Manuel R
Manuel R

Reputation: 4145

Using gather from tidyr changes my regression results

When I run the code below, everything works as expected

# install.packages("dynlm")
# install.packages("tidyr")
require(dynlm)
require(tidyr)


Time <- 1950:1993

Y <- c(5820, 5843, 5917, 6054, 6099, 6365, 6440, 6465, 6449, 6658,  6698, 6740, 6931, 
       7089, 7384, 7703, 8005, 8163, 8506, 8737, 8842, 9022, 9425,  9752, 9602, 9711, 
       10121, 10425, 10744, 10876, 10746, 10770, 10782, 11179, 11617, 12015, 12336, 
       12568, 12903, 13029, 13093, 12899, 13110, 13391)

X <- c(6284, 6390, 6476, 6640, 6628, 6879, 7080, 7114, 7113, 7256, 7264, 7382, 7583, 7718,  
       8140, 8508, 8822, 9114, 9399, 9606, 9875, 10111, 10414, 11013, 10832, 10906, 11192, 
       11406, 11851, 12039, 12005, 12156, 12146, 12349, 13029, 13258, 13552, 13545, 13890, 
       14005, 14101, 14003, 14279, 14341)

data <- data.frame(Time, Y, X)

data_ts <- ts(data, start = 1950, end = 1993, frequency = 1)

Modell <- dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + log(L(X, 3)) 
                     + log(L(X, 4)) + log(L(X, 5)), data = data_ts)
summary(Modell)

My summary output in this case is this

...        
              Estimate  Std. Error t value Pr(>|t|)    
(Intercept)  -0.059109   0.091926  -0.643    0.525    
log(X)        0.883020   0.145754   6.058 9.17e-07 ***
log(L(X))     0.004167   0.211420   0.020    0.984    
log(L(X, 2)) -0.092880   0.207026  -0.449    0.657    
log(L(X, 3)) -0.012016   0.210395  -0.057    0.955    
log(L(X, 4))  0.200596   0.212370   0.945    0.352    
log(L(X, 5))  0.014497   0.144103   0.101    0.920 
...

Now, when I use gather() to define a new data frame for some plots

data_tidyr <- gather(data, "Key", "Value", -Time)

and re-run the above code not changing anything else I get this summary:

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.05669    0.07546  -0.751    0.457    
log(X)        0.82128    0.13486   6.090 3.53e-07 ***
log(L(X))     0.17484    0.13365   1.308    0.198    
log(L(X, 2))       NA         NA      NA       NA    
log(L(X, 3))       NA         NA      NA       NA    
log(L(X, 4))       NA         NA      NA       NA    
log(L(X, 5))       NA         NA      NA       NA 

I am puzzled by this behaviour as the gather operations (defining a new data frame with columns gathered into rows) has nothing do to with the data set I am using to run my regression (at least this was my impression). Somehow using gather() changes the way calculation is done, but I cannot see how. Help would be much appreciated!

Some number:

Update

Ok thank you for all the answers and comments so far, but the question remains: WHAT is going on in the environment? I want to know why and how this happens. To me this is something serious, since to my understanding avoiding non-intended side-effects of one function call on others is precicly what functional languages like R are trying to achieve. Now, unless I am missing something here, this behaviour seems to be at odds with that intention.

Upvotes: 3

Views: 448

Answers (2)

shadow
shadow

Reputation: 22313

The underlying reason for this unexpected change is that dplyr (dplyr, not tidyr) changes the default method of the lag function. The gather function calls dplyr::select_vars, which loads dplyr via namespace and overwrites lag.default.

The dynlm function internally calls lag when you use L in the formula. The method dispatch then finds lag.default. When dplyr is loaded via namespace (it does not even need to be attached), the lag.default from dplyr is found.

The two lag functions are fundamentally different. In a new R session, you will find the following difference:

lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA  1  2

So the solution is fairly simple. Just overwrite the lag.default function yourself.

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928      

lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1951, End = 1993
## 
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##  (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
##     -0.05669       0.82128       0.17484            NA  

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928    

Upvotes: 7

Elin
Elin

Reputation: 6755

When I ran your first block of code in R 3.1.3 I got the results you are showing as your second set of results with this:

(R Version 3.1.3, dynlm version .3-3).

Time series regression with "ts" data:
Start = 1951, End = 1993

Call:
dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + 
    log(L(X, 3)) + log(L(X, 4)) + log(L(X, 5)), data = data_ts)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.030753 -0.006364  0.001321  0.007939  0.025982 

Coefficients: (4 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.05669    0.07546  -0.751    0.457    
log(X)        0.82128    0.13486   6.090 3.53e-07 ***
log(L(X))     0.17484    0.13365   1.308    0.198    
log(L(X, 2))       NA         NA      NA       NA    
log(L(X, 3))       NA         NA      NA       NA    
log(L(X, 4))       NA         NA      NA       NA    
log(L(X, 5))       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01419 on 40 degrees of freedom
Multiple R-squared:  0.9974,    Adjusted R-squared:  0.9972 
F-statistic:  7578 on 2 and 40 DF,  p-value: < 2.2e-16

However when I updated to R 3.2.0 I got a repeat of what you got initially but then settled back into always getting the second results.

Now from your later comments you are also getting the second results consistently. So I think it must be that it is either that there was a typo in the code at some point or that there is something about the first time this runs in an empty environment.

Based on that hypothesis I closed RStudio completely, restarted and ran the first codeblock. In that case I got your initial result again.

So I think the answer to your question has to be there is something odd going on in the environment.

I read the documentation for dynlm and there are a few places where the defaults (if they are coming into play) would cause differences potentially. For example it will take the variables from the environment if no data is specified. It will use either a timeseries object or a dataframe. In your case you have both in the environment (data and data_ts). If you notice on the summary output that I have above it says Time series regression with "ts" data: which means it is running with a ts object. When I am getting the other result (without the NAs, your first result) it says Time series regression with "numeric" data: and in that case it is running with on data which is a dataframe or X and Y directly. So I think that must be the source of the difference. I'm not sure why exactly that would have happened with data_ts explicitly named though.

Upvotes: -1

Related Questions