Reputation: 149
I have a dataset that looks something like this:
data.table(x=c(11:30),y=rnorm(20))
I would like to calculate the rolling regression coefficient and rsquared over the last 10 items:
dtset[,coefficient:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(coef(reg)[1])
},align="right",fill=NA)]
dtset[,rsquare:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(1 - sum((subdtset$y - reg$fitted.values)^2) / sum((subdtset$y - mean(subdtset$y, na.rm=TRUE))^2))
},align="right",fill=NA)]
The code above accomplishes this, but my dataset has millions of rows and I have multiple columns where I want to make these calculations so it is taking a very long time. I am hoping there is a way to speed things up:
Thanks for the help!
Upvotes: 1
Views: 322
Reputation: 269481
Use by.column = FALSE
to pass both columns to the function. In the function calculate the slope and r squared directly to avoid the overhead of lm.fit
. Note that rollapply
can return a vector and that rollapplyr
with an r
on the end is right aligned. This also works if dtset
consists of a single x column followed by multiple y columns as in the example below with the builtin anscombe
data frame.
library(data.table)
library(zoo)
stats <- function(X, x = X[, 1], y = X[, -1]) {
c(slope = cov(x, y) / var(x), rsq = cor(x, y)^2)
}
rollapplyr(dtset, 10, stats, by.column = FALSE, fill = NA)
a <- anscombe[c("x3", "y1", "y2", "y3")]
rollapplyr(a, 3, stats, by.column = FALSE, fill = NA)
We check the formulas using the built-in BOD
data frame.
fm <- lm(demand ~ Time, BOD)
c(coef(fm)[[2]], summary(fm)$r.squared)
## [1] 1.7214286 0.6449202
stats(BOD)
## slope rsq
## 1.7214286 0.6449202
Upvotes: 1