Kevin
Kevin

Reputation: 149

Using Rollapply to return both the Coefficient and RSquare

I have a dataset that looks something like this:

data.table(x=c(11:30),y=rnorm(20))

I would like to calculate the rolling regression coefficient and rsquared over the last 10 items:

dtset[,coefficient:=rollapply(1:20,width=10,FUN=function(a) {
  subdtset <- dtset[a]
  reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
  return(coef(reg)[1])
},align="right",fill=NA)]
dtset[,rsquare:=rollapply(1:20,width=10,FUN=function(a) {
  subdtset <- dtset[a]
  reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
  return(1 - sum((subdtset$y - reg$fitted.values)^2) / sum((subdtset$y - mean(subdtset$y, na.rm=TRUE))^2))
},align="right",fill=NA)]

The code above accomplishes this, but my dataset has millions of rows and I have multiple columns where I want to make these calculations so it is taking a very long time. I am hoping there is a way to speed things up:

  1. Is there a better way to capture the last 10 items in rollapply rather than passing the row numbers as the variable a and then doing subdtset <- dtset[a]? I tried using .SD and .SDcols but was unable to get that to work. I can only figure out how to get rollapply to accept one column or vector as the input, not two columns/vectors.
  2. Is there a way to return 2 values from one rollapply statement? I think I could get significant time savings if I only had to do the regression once, and then from that take the coefficient and calculate RSquare. It's pretty inefficient to do the same calculations twice.

Thanks for the help!

Upvotes: 1

Views: 322

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269481

Use by.column = FALSE to pass both columns to the function. In the function calculate the slope and r squared directly to avoid the overhead of lm.fit. Note that rollapply can return a vector and that rollapplyr with an r on the end is right aligned. This also works if dtset consists of a single x column followed by multiple y columns as in the example below with the builtin anscombe data frame.

library(data.table)
library(zoo)

stats <- function(X, x = X[, 1], y = X[, -1]) {
  c(slope = cov(x, y) / var(x), rsq = cor(x, y)^2)
}
rollapplyr(dtset, 10, stats, by.column = FALSE, fill = NA)

a <- anscombe[c("x3", "y1", "y2", "y3")]
rollapplyr(a, 3, stats, by.column = FALSE, fill = NA)

Check

We check the formulas using the built-in BOD data frame.

fm <- lm(demand ~ Time, BOD)
c(coef(fm)[[2]], summary(fm)$r.squared)
## [1] 1.7214286 0.6449202

stats(BOD)
##     slope       rsq 
## 1.7214286 0.6449202 

Upvotes: 1

Related Questions