Snowflake
Snowflake

Reputation: 3081

What optimisation approaches are there for the current for-loop in R?

I have the following code that I would like to optimise, but I am currently not sure how I can do this. First of all, let me give you an introduction to the problem.

test.data contains approximately 200 000 rows, this makes the implementation below extremely slow in R. The first thing, I tried to do is to optimise the functions and remove as much as testing as possible (if statements), however I am unable to do this in two instances in the code below.

library(data.table)
test.data <- data.table(person = c("A", "B", "C"),
                        duration = c(120,50,30),
                        time = c(159, 231, 312),
                        savings = c(140000, 200000, 300000),
                        ren = c(0.0037, 0.0011, 0.0015),
                        res = c(55, 10, 30))

set.seed(35)

# Deduction series, note that in this example, they are arbitrary.
# They do not follow a pattern. I believe, this is the core of the problem.
# Which makes it extremely difficult to vectorise, since this would result in
# no closed solution.
c_a <- round(runif(max(test.data$duration)), 2) / 10

# Put in as a constant, but it can vary arbitrary.
c_b <- rep(round((8.5 / 12)/100, digits = 4), max(test.data$duration))
rnew <- 0.25
result <- matrix(0, nrow = 6, ncol = 120)

for(j in 1:nrow(test.data)){
  savings <- test.data$savings[j]
  duration <- test.data$duration[j]
  time <- test.data$time[j]
  res <- test.data$res[j]
  m <- matrix(nrow = 6, ncol = duration)

for(i in 1:duration){
  m[1,i] <- ifelse(i == 1, savings, m[6, i-1])

  m[2,i] <- -m[1,i] * c_a[i]

  m[3,i] <- -(m[1,i] + m[2,i]) * c_b[i]

  m[4,i] <- ifelse(i == duration, -(m[1,i] + m[2,i] + m[3,i]), -(m[1,i] + m[2,i]) / (time + 1 - i))

  if(i == res & res < time){
    m[5, i] <- -(m[1,i] + m[2,i]) * (1 - rnew)
  } else {
    m[5, i] <- 0
  }

  m[6, i] <- m[1,i] + m[2,i] + m[3,i] + m[4,i] + m[5,i]
}

  m <- cbind(m, matrix(0, ncol = ncol(result) - ncol(m), nrow = nrow(result)))

  result <- matrix(mapply(sum, result, m, MoreArgs=list(na.rm=T)),ncol=ncol(result))
}

Second, I have tried to vectorise the code, but I believe this is not doable, since c_a and c_b are random values and therefore, I can't simply raise things to a certain power. I believe in order to be able to vectorise the code, I need to be able to write a closed form function, but I am unable to do this.

Third problem I came across is the memory size, if I store all intermediate results in memory, that would explode everything into 3 * 120 * 6, which is quite a significant memory increase in my opinion, so I am literally forced to do this "one at the time".

In addition, I have tried %dopar%, but unfortunately, memory constraints do not allow me to use more than 2 cores (16GB of memory).

Now I am wondering, what optimisation techniques further exists without going as deep as RCpp.

Upvotes: 0

Views: 63

Answers (2)

chinsoon12
chinsoon12

Reputation: 25225

A possible approach to calculate the sum of outstanding amount (i.e. OPs row 1 in result). All the intermediate values (m[2,j], m[3,j], m[4,j], m[5,j]) can be calculated easily if required. Caveat: I did not time it with actual dim

library(data.table)

calcAmor <- function(ca, cb, rnew, dur, S0, tau, res) {
    amortize <- function(S, ca.t) S - ca.t[1L]*S - (1-ca.t[1L])*cb*S - (S - ca.t[1L]*S) / (tau + 1 - ca.t[2L])

    ans <- Reduce(amortize,
        split(cbind(ca, seq_along(ca)), seq_along(ca)),
        init=S0,
        accumulate=TRUE)[-(dur+1L)]

    ix <- min(res+1L, dur):dur
    tmp <- Reduce(amortize,
        split(cbind(ca[ix], ix), seq_along(ix)),
        init=amortize(ans[res], c(ca[res], res)) - (ans[res] - ans[res]*ca[res])*(1-rnew),
        accumulate=TRUE)
    ans[ix] <- tmp[-length(tmp)]    

    ans
}

set.seed(35)
test.data <- data.table(person = c("A", "B", "C"),
    duration = c(120,50,30),
    time = c(159, 231, 312),
    savings = c(140000, 200000, 300000),
    res = c(55, 10, 30))
maxd <- test.data[, max(duration)]
c_a <- round(runif(maxd), 2) / 10
rnew <- 0.25
cb <- round((8.5 / 12)/100, digits = 4)

test.data[, .(
        dur=seq_len(duration),
        S=calcAmor(ca=c_a[seq_len(duration)], cb, rnew, dur=duration, S0=savings, tau=time, res=res)),
    by=.(person)][, sum(S), by=.(dur)]

output:

     dur           V1
  1:   1 6.400000e+05
  2:   2 5.783318e+05
  3:   3 5.711966e+05
  4:   4 5.336450e+05
  5:   5 4.774502e+05
 ---                 
116: 116 7.075169e+00
117: 117 6.788631e+00
118: 118 6.339002e+00
119: 119 5.639335e+00
120: 120 5.297898e+00

Upvotes: 1

Alexis
Alexis

Reputation: 5059

The only thing I can suggest is to initialize m only once with the same dimensions as result, and replace the last 2 lines of the outer loop as shown below. This will avoid re-allocating* m and the element-wise sum done with mapply.

result <- matrix(0, nrow = 6, ncol = 120)
m <- result

for (j in 1:nrow(test.data)) {
  savings <- test.data$savings[j]
  duration <- test.data$duration[j]
  time <- test.data$time[j]
  res <- test.data$res[j]

  for (i in 1:duration) {
    m[1,i] <- ifelse(i == 1, savings, m[6, i-1])

    m[2,i] <- -m[1,i] * c_a[i]

    m[3,i] <- -(m[1,i] + m[2,i]) * c_b[i]

    m[4,i] <- ifelse(i == duration, -(m[1,i] + m[2,i] + m[3,i]), -(m[1,i] + m[2,i]) / (time + 1 - i))

    if (i == res & res < time) {
      m[5, i] <- -(m[1,i] + m[2,i]) * (1 - rnew)
    } else {
      m[5, i] <- 0
    }

    m[6, i] <- m[1,i] + m[2,i] + m[3,i] + m[4,i] + m[5,i]
  }

  result[, 1:duration] <- result[, 1:duration] + m[, 1:duration]
}

Your inner loop has complex dependencies on the results from previous iterations, so I don't know if it can leverage vectorization of operations.

*Technically, R copies the matrix every time it's modified due to copy-on-modify semantics. I believe that R does some special things with its memory management so that each copy doesn't necessarily equate to a new allocation of memory, but the copy still represents overhead. Since you're doing element-wise operations, that may very well be your bottleneck, and migrating to C or C++ with Rcpp might be your best bet.

Upvotes: 0

Related Questions