Reputation: 35
Hi all,
I am trying to use existing columns within a data.table to calculate new columns. However, the columns rely on the previous row's value. For example, say my column Rt = At + Bt + Rt-1. I have two columns that make up my key, scenario and t. How I have been trying to do this is:
for(i in 1:maxScenario){
for(j in 2:nrow(dt)) {
dt[scenario == i & t == j, "R"] <- dt[scenario == i & t == j - 1, "R"]
+ dt[scenario == i & t == j, "A"] + dt[scenario == i & t == j, "B"]
} # end for loop for t
} # end for loop for scenario
The distinction here is that after the "<-" I'm using j - 1 instead of j for R to retrieve the previous row's value.
I realize this is adding a lot of computation time, and is a pretty rough way to go about this. Is there a better way within the data.table package to do this? I have tried using shift() but ran into problems there. Using shift() doesn't "recalculate" the columns based on A and B.
I have considered using a recursive formula, but I wasn't sure what that would do to efficiency and run time. Ideally, I'm hoping to run about 100K scenarios and need these calculations tacked on after the stochastic scenarios are completed.
Thanks!
Here's an attempt at a small example. Each row's value of R depends on the value from the previous row.
t R A B
1 0 1 2
2 3 2 3
3 8 2 5
4 15 8 5
5 28 10 8
I was finally able to translate my actual problem function into algebra:
Rt = λ * Pt + λ * Rt-1 - min{λ * Pt + λ * Rt-1, Dt} - A(t) * max{λ * Pt + λ * Rt-1 - Dt - Mt, 0} where Pt, Dt, and Mt are other known columns and A(t) is an indicator function that returns 0 when t % 4 is != 0, and 1 otherwise.
Is there a way to use shift() and cumsum() with such a nested equation?
Upvotes: 1
Views: 360
Reputation: 25225
Here is an option using Rcpp
with data.table
as its easier to think/code in cpp
for recursive equation:
DT[, A := +(t %% 4 == 0)]
library(Rcpp)
cppFunction('NumericVector recur(double lambda, NumericVector P,
NumericVector D, NumericVector M, NumericVector A) {
int sz = P.size(), t;
NumericVector R(sz);
for (t=1; t<sz; t++) {
R[t] = lambda * P[t] + lambda * R[t-1] -
std::min(lambda * P[t] + lambda * R[t-1], D[t]) -
A[t] * std::max(lambda * P[t] * lambda * R[t-1] - D[t] - M[t], 0.0);
}
return(R);
}')
DT[, R := recur(lambda, P, D, M, A)]
output:
t P D M A R
1: 1 1.262954285 0.25222345 -0.4333103 0 0.00000000
2: 2 -0.326233361 -0.89192113 -0.6494716 0 0.72880445
3: 3 1.329799263 0.43568330 0.7267507 0 0.59361856
4: 4 1.272429321 -1.23753842 1.1519118 1 1.89610128
5: 5 0.414641434 -0.22426789 0.9921604 0 1.37963924
6: 6 -1.539950042 0.37739565 -0.4295131 0 0.00000000
7: 7 -0.928567035 0.13333636 1.2383041 0 0.00000000
8: 8 -0.294720447 0.80418951 -0.2793463 1 0.00000000
9: 9 -0.005767173 -0.05710677 1.7579031 0 0.05422319
10: 10 2.404653389 0.50360797 0.5607461 0 0.72583032
11: 11 0.763593461 1.08576936 -0.4527840 0 0.00000000
12: 12 -0.799009249 -0.69095384 -0.8320433 1 -1.23154792
13: 13 -1.147657009 -1.28459935 -1.1665705 0 0.09499689
14: 14 -0.289461574 0.04672617 -1.0655906 0 0.00000000
15: 15 -0.299215118 -0.23570656 -1.5637821 0 0.08609900
16: 16 -0.411510833 -0.54288826 1.1565370 1 0.38018234
data:
library(data.table)
set.seed(0L)
nr <- 16L
DT <- data.table(t=1L:nr, P=rnorm(nr), D=rnorm(nr), M=rnorm(nr))
lambda <- 0.5
Upvotes: 1
Reputation: 8572
There is to my knowledge there is no way to iteratively calculate the rows with buildin functions from data.table
. I even believe there's a duplicate question out there, that has a similar question (although I cannot find it right now).
We can however speed up the calculations by noting the tricks we could use in the formulation. First to obtain the result in the example provided, we can note this is just cumsum(shift(A, 1, fill = 0) + shift(B, 1, fill = 0))
dt <- fread('t R A B
1 0 1 2
2 3 2 3
3 8 2 5
4 15 8 5
5 28 10 8')
dt[, R2 := cumsum(shift(A, 1, fill = 0) + shift(B, 1, fill = 0))]
dt
t R A B R2
1: 1 0 1 2 0
2: 2 3 2 3 3
3: 3 8 2 5 8
4: 4 15 8 5 15
5: 5 28 10 8 28
However for the exact problem described Rt = At + Bt + Rt-1 we will have to be a bit smarter
dt[, R3 := cumsum(A + B) - head(A + B, 1)]
dt
t R A B R2 R3
1: 1 0 1 2 0 0
2: 2 3 2 3 3 5
3: 3 8 2 5 8 12
4: 4 15 8 5 15 25
5: 5 28 10 8 28 43
Which follows the above description. Note that I remove the first row, with the assumption that R0 = 0
, otherwise it simply becomes cumsum(A + B)
As the question is asking for some possibly more complicated situations, I'll add an example using a slower (but more general) example. The idea here is to use the set
function, to avoid intermediary shallow copes (see help(set)
or help("datatable-optimize")
).
dt[, R4 := 0]
for(i in seq.int(2, dt[, .N])){
#dummy complicated scenario
f <- dt[seq(i), lm(A ~ B - 1)]
set(dt, i, 'R4', unname(unlist(coef(f))))
}
dt
t R A B R2 R3 R4
1: 1 0 1 2 0 0 0.0000000
2: 2 3 2 3 3 5 0.6153846
3: 3 8 2 5 8 12 0.4736842
4: 4 15 8 5 15 25 0.9206349
5: 5 28 10 8 28 43 1.0866142
Upvotes: 1
Reputation: 27732
This creates a new column R2
wirth the same values as R
DT[, R2 := shift( cumsum(A+B), type = "lag", fill = 0 ) ][]
# t R A B R2
# 1: 1 0 1 2 0
# 2: 2 3 2 3 3
# 3: 3 8 2 5 8
# 4: 4 15 8 5 15
# 5: 5 28 10 8 28
Upvotes: 1