Reputation: 3081
I have the following code that I would like to optimise, but I am currently not sure how I can do this. First of all, let me give you an introduction to the problem.
test.data
contains approximately 200 000 rows, this makes the implementation below extremely slow in R. The first thing, I tried to do is to optimise the functions and remove as much as testing as possible (if
statements), however I am unable to do this in two instances in the code below.
library(data.table)
test.data <- data.table(person = c("A", "B", "C"),
duration = c(120,50,30),
time = c(159, 231, 312),
savings = c(140000, 200000, 300000),
ren = c(0.0037, 0.0011, 0.0015),
res = c(55, 10, 30))
set.seed(35)
# Deduction series, note that in this example, they are arbitrary.
# They do not follow a pattern. I believe, this is the core of the problem.
# Which makes it extremely difficult to vectorise, since this would result in
# no closed solution.
c_a <- round(runif(max(test.data$duration)), 2) / 10
# Put in as a constant, but it can vary arbitrary.
c_b <- rep(round((8.5 / 12)/100, digits = 4), max(test.data$duration))
rnew <- 0.25
result <- matrix(0, nrow = 6, ncol = 120)
for(j in 1:nrow(test.data)){
savings <- test.data$savings[j]
duration <- test.data$duration[j]
time <- test.data$time[j]
res <- test.data$res[j]
m <- matrix(nrow = 6, ncol = duration)
for(i in 1:duration){
m[1,i] <- ifelse(i == 1, savings, m[6, i-1])
m[2,i] <- -m[1,i] * c_a[i]
m[3,i] <- -(m[1,i] + m[2,i]) * c_b[i]
m[4,i] <- ifelse(i == duration, -(m[1,i] + m[2,i] + m[3,i]), -(m[1,i] + m[2,i]) / (time + 1 - i))
if(i == res & res < time){
m[5, i] <- -(m[1,i] + m[2,i]) * (1 - rnew)
} else {
m[5, i] <- 0
}
m[6, i] <- m[1,i] + m[2,i] + m[3,i] + m[4,i] + m[5,i]
}
m <- cbind(m, matrix(0, ncol = ncol(result) - ncol(m), nrow = nrow(result)))
result <- matrix(mapply(sum, result, m, MoreArgs=list(na.rm=T)),ncol=ncol(result))
}
Second, I have tried to vectorise the code, but I believe this is not doable, since c_a
and c_b
are random values and therefore, I can't simply raise things to a certain power. I believe in order to be able to vectorise the code, I need to be able to write a closed form function, but I am unable to do this.
Third problem I came across is the memory size, if I store all intermediate results in memory, that would explode everything into 3 * 120 * 6
, which is quite a significant memory increase in my opinion, so I am literally forced to do this "one at the time".
In addition, I have tried %dopar%
, but unfortunately, memory constraints do not allow me to use more than 2 cores (16GB of memory).
Now I am wondering, what optimisation techniques further exists without going as deep as RCpp.
Upvotes: 0
Views: 63
Reputation: 25225
A possible approach to calculate the sum of outstanding amount (i.e. OPs row 1 in result). All the intermediate values (m[2,j]
, m[3,j]
, m[4,j]
, m[5,j]
) can be calculated easily if required. Caveat: I did not time it with actual dim
library(data.table)
calcAmor <- function(ca, cb, rnew, dur, S0, tau, res) {
amortize <- function(S, ca.t) S - ca.t[1L]*S - (1-ca.t[1L])*cb*S - (S - ca.t[1L]*S) / (tau + 1 - ca.t[2L])
ans <- Reduce(amortize,
split(cbind(ca, seq_along(ca)), seq_along(ca)),
init=S0,
accumulate=TRUE)[-(dur+1L)]
ix <- min(res+1L, dur):dur
tmp <- Reduce(amortize,
split(cbind(ca[ix], ix), seq_along(ix)),
init=amortize(ans[res], c(ca[res], res)) - (ans[res] - ans[res]*ca[res])*(1-rnew),
accumulate=TRUE)
ans[ix] <- tmp[-length(tmp)]
ans
}
set.seed(35)
test.data <- data.table(person = c("A", "B", "C"),
duration = c(120,50,30),
time = c(159, 231, 312),
savings = c(140000, 200000, 300000),
res = c(55, 10, 30))
maxd <- test.data[, max(duration)]
c_a <- round(runif(maxd), 2) / 10
rnew <- 0.25
cb <- round((8.5 / 12)/100, digits = 4)
test.data[, .(
dur=seq_len(duration),
S=calcAmor(ca=c_a[seq_len(duration)], cb, rnew, dur=duration, S0=savings, tau=time, res=res)),
by=.(person)][, sum(S), by=.(dur)]
output:
dur V1
1: 1 6.400000e+05
2: 2 5.783318e+05
3: 3 5.711966e+05
4: 4 5.336450e+05
5: 5 4.774502e+05
---
116: 116 7.075169e+00
117: 117 6.788631e+00
118: 118 6.339002e+00
119: 119 5.639335e+00
120: 120 5.297898e+00
Upvotes: 1
Reputation: 5059
The only thing I can suggest is to initialize m
only once with the same dimensions as result
,
and replace the last 2 lines of the outer loop as shown below.
This will avoid re-allocating* m
and the element-wise sum
done with mapply
.
result <- matrix(0, nrow = 6, ncol = 120)
m <- result
for (j in 1:nrow(test.data)) {
savings <- test.data$savings[j]
duration <- test.data$duration[j]
time <- test.data$time[j]
res <- test.data$res[j]
for (i in 1:duration) {
m[1,i] <- ifelse(i == 1, savings, m[6, i-1])
m[2,i] <- -m[1,i] * c_a[i]
m[3,i] <- -(m[1,i] + m[2,i]) * c_b[i]
m[4,i] <- ifelse(i == duration, -(m[1,i] + m[2,i] + m[3,i]), -(m[1,i] + m[2,i]) / (time + 1 - i))
if (i == res & res < time) {
m[5, i] <- -(m[1,i] + m[2,i]) * (1 - rnew)
} else {
m[5, i] <- 0
}
m[6, i] <- m[1,i] + m[2,i] + m[3,i] + m[4,i] + m[5,i]
}
result[, 1:duration] <- result[, 1:duration] + m[, 1:duration]
}
Your inner loop has complex dependencies on the results from previous iterations, so I don't know if it can leverage vectorization of operations.
*Technically, R copies the matrix every time it's modified due to copy-on-modify semantics.
I believe that R does some special things with its memory management so that each copy doesn't necessarily equate to a new allocation of memory,
but the copy still represents overhead.
Since you're doing element-wise operations,
that may very well be your bottleneck,
and migrating to C or C++ with Rcpp
might be your best bet.
Upvotes: 0