Rishi
Rishi

Reputation: 19

for loop speeding in R

I am trying to do the following operation in R for nrow=300,000 simulations (on ncol=30 variables):

down vote

accept

here's my code:

FS_DF <- read.csv("fs.csv", sep = ",")
Y_DF <- read.csv("Y.csv", sep = ",")
CALIBSCENS_DF <- read.csv("calib_scens.csv", sep = ",")
Y_DF$X <- NULL

X_mat <- matrix(1:1, nrow(CALIBSCENS_DF), nrow(FS_DF))

for (irow in 1:nrow(CALIBSCENS_DF)) { 
for (jrow in 1:nrow(FS_DF)) { 
for (krow in 1:ncol(FS_DF)) { 
    X_mat [irow, jrow] <- X_mat[irow, jrow] * (CALIBSCENS_DF[irow, krow] ^ FS_DF[jrow, krow])

}}}

fit <- .lm.fit(X_mat, as.matrix(sapply(Y_DF, as.numeric)))

Its taking forever to fill my X matrix. Can someone suggest a faster approach to do this operation. SCENS_DF, FS_DF are data frames. X_mat is a matrix.

Upvotes: 1

Views: 110

Answers (3)

David
David

Reputation: 10232

If this code is your bottleneck and you use loops, thats always a good sign that cpp might yield good results. We can use Rcpp to make it easier and have the cpp-function within our code.

Below you find my approach using Rcpp and some benchmarks against minem's approach, shaving off roughly 20% of runtime (highly depending on the sizes of the matrices).

library(Rcpp) # load the Rcpp library

# create some data...
CALIBSCENS_DF <- matrix(2:5, nrow = 2)
FS_DF <- matrix(2:5, nrow = 2)

# create the cpp-function, basically the same as yours, just adapted to cpp
cppFunction("
NumericMatrix cpp_fun(NumericMatrix A, NumericMatrix B) {
    NumericMatrix retMax(A.nrow(), B.nrow());

    long double mult;
    for (int irow = 0; irow < A.nrow(); irow++) {
        for (int jrow = 0; jrow < B.nrow(); jrow++) {
            mult = 1;
            for (int krow = 0; krow < B.ncol(); krow++) {
                mult *= pow(A(irow, krow), B(jrow, krow));
            }
            retMax(irow, jrow) = mult;
        }
    }
    return retMax;
}
")
# execute the function called 'cpp_fun' in R
cpp_mat <- cpp_fun(CALIBSCENS_DF, FS_DF)
cpp_mat
# [,1]  [,2]
# [1,] 1024  8192
# [2,] 5625 84375

Compare the function to the result shown by Minem

# for comparison, use Minems function
minem_fun <- function(A_mat, B_mat) {
  X <- matrix(1, ncol = nrow(B_mat), nrow = nrow(A_mat))
  for (irow in 1:nrow(A_mat)) {
    for (jrow in 1:nrow(B_mat)) {
      X [irow, jrow] <- prod(A_mat[irow, ] ^ B_mat[jrow, ])
    }
  }
  return(X)
}
minem_mat <- minem_fun(CALIBSCENS_DF, FS_DF)

identical(cpp_mat, minem_mat)
# [1] TRUE

Speed-benchmark

library(microbenchmark)
# small data
microbenchmark(
  minem = minem_fun(CALIBSCENS_DF, FS_DF),
  cpp = cpp_fun(CALIBSCENS_DF, FS_DF),
  times = 1000
)
# Unit: microseconds
# expr   min     lq      mean median     uq      max neval
# minem 9.386 10.239 11.198179  10.24 10.667   49.915  1000
# cpp 1.707  2.560  3.954538   2.56  2.987 1098.980  1000


# larger data
n <- 200
CALIB_large <- matrix(rnorm(n^2, mean = 100, sd = 10), nrow = n, ncol = n)
FS_large <- matrix(rnorm(n^2, mean = 2, sd = 0.5), nrow = n, ncol = n)

microbenchmark(
  minem = minem_fun(CALIB_large, FS_large),
  cpp = cpp_fun(CALIB_large, FS_large),
  times = 10
)
# Unit: seconds
# expr      min       lq     mean   median       uq      max neval
# minem 1.192011 1.197783 1.209692 1.201320 1.230812 1.238446    10
# cpp 1.009908 1.019727 1.023600 1.025791 1.028152 1.029427    10

Does that help you out?

Upvotes: 2

minem
minem

Reputation: 3650

It looks like we can remove one loop this way:

CALIBSCENS_DF <- matrix(2:5, nrow = 2)
FS_DF <- matrix(2:5, nrow = 2)
X <- matrix(1, ncol = nrow(FS_DF), nrow = nrow(CALIBSCENS_DF))
for (irow in 1:nrow(CALIBSCENS_DF)) { 
  for (jrow in 1:nrow(FS_DF)) { 
      X [irow, jrow] <-
        X[irow, jrow] * prod(CALIBSCENS_DF[irow, ] ^ FS_DF[jrow, ])
    }}
X
#      [,1]  [,2]
# [1,] 1024  8192
# [2,] 5625 84375

Upvotes: 1

Benjamin
Benjamin

Reputation: 17279

This isn't really an answer to your question yet, but won't fit in a comment. I think we need to take a good look at what you're attempting to do and decide if the for loop is doing what you think it is.

Let's simplify the code a little. Let's have matrices X, C, and F and define the loop

for (i in 1:nrow(C)){
  for (j in 1:nrow(F)){
    for (k in 1:ncol(F)){
      X[i, j] <- X[i, j] * C[i, k] ^ F[j, k]
    }
  }
}

Now let's step through what will happen as the loop iterates

i = 1; j = 1; k = 1      X[1, 1] <- X[1, 1] * C[1, 1] ^ F[1, 1]
i = 1; j = 1; k = 2      X[1, 1] <- X[1, 1] * C[1, 2] ^ F[1, 2]
i = 1; j = 1; k = 3      X[1, 1] <- X[1, 1] * C[1, 3] ^ F[1, 3]
...
i = 1; j = 1; k = 30      X[1, 1] <- X[1, 1] * C[1, 30] ^ F[1, 30]

Ultimately, X[1, 1] is dependent on C[1, 30] and F[1, 30]. You've done 29 iterations that have been overwritten. At this point, the loop will increment j and you'll get

i = 1; j = 2; k = 1      X[1, 2] <- X[1, 2] * C[1, 1] ^ F[2, 1]
i = 1; j = 2; k = 2      X[1, 2] <- X[1, 2] * C[1, 2] ^ F[2, 2]
i = 1; j = 2; k = 3      X[1, 2] <- X[1, 2] * C[1, 3] ^ F[2, 3]
...
i = 1; j = 2; k = 30      X[1, 2] <- X[1, 2] * C[1, 30] ^ F[2, 30]

So X[1, 2] is dependent on C[1, 30] and F[2, 30].

Is this the behavior you are expecting?

Upvotes: 0

Related Questions