Hugh Perkins
Hugh Perkins

Reputation: 8602

Fast way of evaluating a formula?

I'm using either dyn or dynlm to predict time series using lagged variables.

However, the predict function in either case only evaluates one time step at a time, taking a constant time of 24 milliseconds per step on my computer, or about 1.8 hours for my dataset, which is super long, given that the entire regression takes about 10 seconds.

So, I'm thinking that perhaps the fastest thing might be just to evaluate the formula by hand?

So, is there some way of evaluating a formula given values in a data.frame or the current envrironment or similar?

I'm thinking of something along the lines of:

evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ) )

I suppose, as I write this, that we need to handle the coefficients somehow, something like:

evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ), model$coefficients )

.... so this raises the questions of:

Upvotes: 2

Views: 430

Answers (2)

Hugh Perkins
Hugh Perkins

Reputation: 8602

I wrote my own lag implementation in the end. It's hacky and not beautiful, but it's a lot faster. It can process 1000 rows in 4 seconds on my crappy laptop.

# lags is a data.frame, eg:
#   var  amount
#   y    1
#   y    2
addLags <- function( dataset, lags ) {
    N <- nrow(dataset)
    print(lags)
    if( nrow(lags) > 0 ) {
        print(lags)
        for( j in 1:nrow(lags) ) {
            sourcename <- as.character( lags[j,"var"] )
            k <- lags[j,"amount"]
            cat("k",k,"sourcename",sourcename,"\n")
            lagcolname <- sprintf("%s_%d",sourcename,k)
            dataset[,lagcolname] <- c(rep(0,k), dataset[1:(N-k),sourcename])
        }
    }
    dataset
}

lmLagged <- function( formula, train, lags ) {
    # get largest lag, and skip that
    N <- nrow(train)
    skip <- 0
    for( j in 1:nrow(lags) ) {
        k <- lags[j,"amount"]
        skip <- max(k,skip)
    }
    print(train)
    train <- addLags( train, lags )
    print(train)
    lm( formula, train[(skip+1):N,] )
}

# pass in training data, test data,
# it will step through one by one
# need to give dependent var name
# lags is a data.frame, eg:
#   var amount
#   y    1
#   y    2
predictLagged <- function( model, train, test, dependentvarname, lags ) {
    Ntrain <- nrow(train)
    Ntest <- nrow(test)
    test[,dependentvarname] <- NA
    testtraindata <- rbind( train, test )
    testtraindata <- addLags( testtraindata, lags )
    for( i in 1:Ntest ) {
       thistestdata <- testtraindata[Ntrain + i,]
       result <- predict(model,newdata=thistestdata)
       for( j in 1:nrow(lags) ) {
            sourcename <- lags[j,"var"]
            k <- lags[j,"amount"]
            lagcolname <- sprintf("%s_%d",sourcename,k)
            testtraindata[Ntrain + i + k,lagcolname] <- result
       }
       testtraindata[Ntrain+i,dependentvarname] <- result
    }
    return( testtraindata[(Ntrain+1):(Ntrain + Ntest),dependentvarname] )    
}

library("RUnit")

# size of training data
N <- 6
predictN <- 50

# create training data, which we can get exact fit on
set.seed(1)
x = sample( 100, N )
traindata <- numeric()
traindata[1] <- 1 + 1.1 * x[1]
traindata[2] <- 2 + 1.1 * x[2]
for( i in 3:N ) {
   traindata[i] <- 0.5 + 0.3 * traindata[i-2] - 0.8 * traindata[i-1] + 1.1 * x[i]
}
train <- data.frame(x = x, y = traindata, foo = 1)
#train$x <- NULL

# create testing data, bunch of NAs
test <- data.frame( x = sample(100,predictN), y = rep(NA,predictN), foo = 1)

# specify which lags we need to handle
# one row per lag, with name of variable we are lagging, and the distance
# we can then use these in the formula, eg y_1, and y_2
# are y lagged by 1 and 2 respectively
# It's hacky but it kind of works...
lags <- data.frame( var = c("y","y"), amount = c(1,2) ) 

# fit a model
model <- lmLagged(  y ~ x + y_1 + y_2, train, lags )
# look at the model, it's a perfect fit. Nice!
print(model)

print(system.time( test <- predictLagged( model, train, test, "y", lags ) ))
#checkEqualsNumeric( 69.10228, test[56-6], tolerance = 0.0001 )
#checkEquals( 2972.159, test$y[106-6] )
print(test)

# nice plot
plot(test, type='l')

Output:

> source("test/test.regressionlagged.r",echo=F)

Call:
lm(formula = formula, data = train[(skip + 1):N, ])

Coefficients:
(Intercept)            x          y_1          y_2  
        0.5          1.1         -0.8          0.3  

   user  system elapsed 
  0.204   0.000   0.204 
 [1]  -19.108620  131.494916  -42.228519   80.331290  -54.433588   86.846257
 [7]  -13.807082   77.199543   12.698241   64.101270   56.428457   72.487616
[13]   -3.161555   99.575529    8.991110   44.079771   28.433517    3.077118
[19]   30.768361   12.008447    2.323751   36.343533   67.822299  -13.154779
[25]   72.070513  -11.602844  115.003429  -79.583596  164.667906 -102.309403
[31]  193.347894 -176.071136  254.361277 -225.010363  349.216673 -299.076448
[37]  400.626160 -371.223862  453.966938 -420.140709  560.802649 -542.284332
[43]  701.568260 -679.439907  839.222404 -773.509895  897.474637 -935.232679
[49] 1022.328534 -991.232631

There's about 12 hours work in those 91 lines of code. Ok, I confess I played Plants and Zombies for a bit. So, 10 hours. Plus lunch and dinner. Still, quite a lot of work anyway.

If we change predictN to 1000, I get about 4.1 seconds from the system.time call.

I think it's faster because:

  • we don't use timeseries; I suspect that speeds things up
  • we don't use dynamic lm libraries, just normal lm; I guess that's slightly faster
  • we only pass a single row of data into predict for each prediction, which I think is significantly faster, eg using dyn$lm or dynmlm, if one has a lag of 30, one would need to pass 31 rows of data into predict AFAIK
  • a lot less data.frame/matrix copying, since we just update the lag values in-place on each iteration

Edit: corrected minor buggette where predictLagged returned a multi-column data-frame instead of just a numeric vector Edit2: corrected less minor bug where you couldn't add more than one variable. Also reconciled the comments and code for lags, and changed the lags structure to "var" and "amount" in place of "name" and "lags". Also, updated the test code to add a second variable.

Edit: there are tons of bugs in this version, which I know, because I've unit-tested it a bit more and fixed them, but copying and pasting is very time-consuming, so I will update this post in a few days, once my deadline is over.

Upvotes: 2

Ben Bolker
Ben Bolker

Reputation: 226332

Maybe you're looking for this:

fastlinpred <- function(formula, newdata, coefs) {
   X <- model.matrix( formula, data=newdata)
   X %*% coefs
}
coefs <- c(1,2,3) 
dd <- data.frame( temperature = 10, time = 4 )
fastlinpred(  ~ temperature + time, 
      dd , coefs )

This assumes that the formula has only a RHS (you can get rid of the LHS of a formula by doing form[-2]).

This certainly gets rid of a lot of the overhead of predict.lm, but I don't know if it is as fast as you want. model.matrix has a lot of internal machinery too.

Upvotes: 1

Related Questions