Reputation: 8602
I'm using either dyn or dynlm to predict time series using lagged variables.
However, the predict function in either case only evaluates one time step at a time, taking a constant time of 24 milliseconds per step on my computer, or about 1.8 hours for my dataset, which is super long, given that the entire regression takes about 10 seconds.
So, I'm thinking that perhaps the fastest thing might be just to evaluate the formula by hand?
So, is there some way of evaluating a formula given values in a data.frame or the current envrironment or similar?
I'm thinking of something along the lines of:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ) )
I suppose, as I write this, that we need to handle the coefficients somehow, something like:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ), model$coefficients )
.... so this raises the questions of:
Upvotes: 2
Views: 430
Reputation: 8602
I wrote my own lag implementation in the end. It's hacky and not beautiful, but it's a lot faster. It can process 1000 rows in 4 seconds on my crappy laptop.
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
addLags <- function( dataset, lags ) {
N <- nrow(dataset)
print(lags)
if( nrow(lags) > 0 ) {
print(lags)
for( j in 1:nrow(lags) ) {
sourcename <- as.character( lags[j,"var"] )
k <- lags[j,"amount"]
cat("k",k,"sourcename",sourcename,"\n")
lagcolname <- sprintf("%s_%d",sourcename,k)
dataset[,lagcolname] <- c(rep(0,k), dataset[1:(N-k),sourcename])
}
}
dataset
}
lmLagged <- function( formula, train, lags ) {
# get largest lag, and skip that
N <- nrow(train)
skip <- 0
for( j in 1:nrow(lags) ) {
k <- lags[j,"amount"]
skip <- max(k,skip)
}
print(train)
train <- addLags( train, lags )
print(train)
lm( formula, train[(skip+1):N,] )
}
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
predictLagged <- function( model, train, test, dependentvarname, lags ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
test[,dependentvarname] <- NA
testtraindata <- rbind( train, test )
testtraindata <- addLags( testtraindata, lags )
for( i in 1:Ntest ) {
thistestdata <- testtraindata[Ntrain + i,]
result <- predict(model,newdata=thistestdata)
for( j in 1:nrow(lags) ) {
sourcename <- lags[j,"var"]
k <- lags[j,"amount"]
lagcolname <- sprintf("%s_%d",sourcename,k)
testtraindata[Ntrain + i + k,lagcolname] <- result
}
testtraindata[Ntrain+i,dependentvarname] <- result
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),dependentvarname] )
}
library("RUnit")
# size of training data
N <- 6
predictN <- 50
# create training data, which we can get exact fit on
set.seed(1)
x = sample( 100, N )
traindata <- numeric()
traindata[1] <- 1 + 1.1 * x[1]
traindata[2] <- 2 + 1.1 * x[2]
for( i in 3:N ) {
traindata[i] <- 0.5 + 0.3 * traindata[i-2] - 0.8 * traindata[i-1] + 1.1 * x[i]
}
train <- data.frame(x = x, y = traindata, foo = 1)
#train$x <- NULL
# create testing data, bunch of NAs
test <- data.frame( x = sample(100,predictN), y = rep(NA,predictN), foo = 1)
# specify which lags we need to handle
# one row per lag, with name of variable we are lagging, and the distance
# we can then use these in the formula, eg y_1, and y_2
# are y lagged by 1 and 2 respectively
# It's hacky but it kind of works...
lags <- data.frame( var = c("y","y"), amount = c(1,2) )
# fit a model
model <- lmLagged( y ~ x + y_1 + y_2, train, lags )
# look at the model, it's a perfect fit. Nice!
print(model)
print(system.time( test <- predictLagged( model, train, test, "y", lags ) ))
#checkEqualsNumeric( 69.10228, test[56-6], tolerance = 0.0001 )
#checkEquals( 2972.159, test$y[106-6] )
print(test)
# nice plot
plot(test, type='l')
Output:
> source("test/test.regressionlagged.r",echo=F)
Call:
lm(formula = formula, data = train[(skip + 1):N, ])
Coefficients:
(Intercept) x y_1 y_2
0.5 1.1 -0.8 0.3
user system elapsed
0.204 0.000 0.204
[1] -19.108620 131.494916 -42.228519 80.331290 -54.433588 86.846257
[7] -13.807082 77.199543 12.698241 64.101270 56.428457 72.487616
[13] -3.161555 99.575529 8.991110 44.079771 28.433517 3.077118
[19] 30.768361 12.008447 2.323751 36.343533 67.822299 -13.154779
[25] 72.070513 -11.602844 115.003429 -79.583596 164.667906 -102.309403
[31] 193.347894 -176.071136 254.361277 -225.010363 349.216673 -299.076448
[37] 400.626160 -371.223862 453.966938 -420.140709 560.802649 -542.284332
[43] 701.568260 -679.439907 839.222404 -773.509895 897.474637 -935.232679
[49] 1022.328534 -991.232631
There's about 12 hours work in those 91 lines of code. Ok, I confess I played Plants and Zombies for a bit. So, 10 hours. Plus lunch and dinner. Still, quite a lot of work anyway.
If we change predictN to 1000, I get about 4.1 seconds from the system.time
call.
I think it's faster because:
Edit: corrected minor buggette where predictLagged returned a multi-column data-frame instead of just a numeric vector Edit2: corrected less minor bug where you couldn't add more than one variable. Also reconciled the comments and code for lags, and changed the lags structure to "var" and "amount" in place of "name" and "lags". Also, updated the test code to add a second variable.
Edit: there are tons of bugs in this version, which I know, because I've unit-tested it a bit more and fixed them, but copying and pasting is very time-consuming, so I will update this post in a few days, once my deadline is over.
Upvotes: 2
Reputation: 226332
Maybe you're looking for this:
fastlinpred <- function(formula, newdata, coefs) {
X <- model.matrix( formula, data=newdata)
X %*% coefs
}
coefs <- c(1,2,3)
dd <- data.frame( temperature = 10, time = 4 )
fastlinpred( ~ temperature + time,
dd , coefs )
This assumes that the formula has only a RHS (you can get rid of the LHS of a formula by doing form[-2]
).
This certainly gets rid of a lot of the overhead of predict.lm
, but I don't know if it is as fast as you want. model.matrix
has a lot of internal machinery too.
Upvotes: 1