Rasmus
Rasmus

Reputation: 43

Replacing a for loop in R to speed up the code

I'm using R to investigate how the return affects a person's pension account. In order to do this I'm calculating the development of the pension account from age 25 until retirement at age 70 for 1000 different return scenarios. I'm using the variables expenses(e), monthly deposit(m), return in percent(r), account balance (y) and return in euros (x). They are all stored in data frames with the dimensions 46x1000.

I have succesfully managed to calculate it using a for loop. However this is very slow, and since i'm doing a lot of these i am wondering if someone have an idea to speed up the code. I have tried with apply functions and with vectorization but cannot get it to work. My problem is that i have to calculate the numbers for year i before calculating the numbers for year i+1. I have searched the internet for a solution, but have a hard time finding answers which apply for my specific problem. I should note that I'm still pretty new to R.

A have written a simplified version of the code im using:

for (i in 3:46) {
x[i-1,]<-(y[i-1,]+m[i-1,]*6-0.5*e[i-1,])*r[i-1,]
y[i,]<-y[i-1,]+x[i-1,]-e[i-1,]+m[i-1,]*12 
}

I hope someone is able to help, and thanks in advance.

Best regards Rasmus

Upvotes: 1

Views: 123

Answers (1)

rosscova
rosscova

Reputation: 5590

Your process looks to me like it needs the loop, since each iteration depends on the one before it. As @Gregor de Cillia mentions in the comments, you could do this in C++ for a speed improvement.

First, set up some data.

set.seed(1)
e <- matrix( data = rnorm( n = 46000, mean = 1000, sd = 200 ),
                         nrow = 46,
                         ncol = 1000 )
m <- matrix( data = rnorm( n = 46000, mean = 2000, sd = 200 ),
                         nrow = 46,
                         ncol = 1000 )
r <- matrix( data = rnorm( n = 46000, mean = 4, sd = 0.5 ),
                         nrow = 46,
                         ncol = 1000 )
x <- matrix( data = NA_real_, nrow = 45, ncol = 1000 )
y <- matrix( data = NA_real_, nrow = 46, ncol = 1000 )
y[1,] <- rnorm( n = 1000, 10000, 1000 )

Then define a C++ function in an Rcpp file. This returns a list with your two matrices x and y as list items:

List pension( NumericMatrix e,
              NumericMatrix m,
              NumericMatrix r,
              NumericVector yfirstrow ) {

    int ncols = e.cols();
    int nrows = e.rows();

    NumericMatrix x( nrows - 1, ncols );
    NumericMatrix y( nrows, ncols );

    y( 0, _ ) = yfirstrow;

    for( int i = 1; i < nrows; i++ ) {
        x( i-1, _ ) = ( y( i-1, _ ) + m( i-1, _ ) * 6 - 0.5 * e( i-1, _ ) ) * r( i-1, _ );
        y( i, _ ) = y( i-1, _ ) + x( i-1, _ ) - e( i-1, _ ) + m( i-1, _ )* 12;
    };

    List ret;
    ret["x"] = x;
    ret["y"] = y;

    return ret;

}

Compare the two methods for speed.

microbenchmark::microbenchmark(
    R = {
        for (i in 2:46) {
            x[i-1,] <- unlist( (y[i-1,] + m[i-1,]*6 - 0.5*e[i-1,] ) * r[i-1,] )
            y[i,]<- unlist( y[i-1,]+x[i-1,]-e[i-1,]+m[i-1,]*12 )
        }
    },
    cpp = {
        cppList <- pension( e, m, r, y[1,] )
    },
    times = 100
)

Make sure the outputs match:

> identical( x, cppList$x )
[1] TRUE
> identical( y, cppList$y )
[1] TRUE

The speed test results:

Unit: microseconds
 expr      min       lq     mean   median       uq       max neval
    R 3309.962 3986.569 6961.838 5244.479 6219.215 96576.592   100
  cpp  879.713  992.229 1266.014 1124.345 1273.691  3041.966   100

So the Rcpp solution is around 5x faster here, but to be honest, the R loop you've made isn't too shabby for the dataset you're working with (with only 45 iterations, the overhead of the R loop isn't too much of a hindrance). If you really need the speed, c++ can help.

Upvotes: 4

Related Questions