Reputation: 35

Iteratively calculate columns of a data.table one row at a time (recursive column definitions)

Background/Example

Hi all,

I am trying to use existing columns within a data.table to calculate new columns. However, the columns rely on the previous row's value. For example, say my column R_t = A_t + B_t + R_t-1. I have two columns that make up my key, scenario and t. How I have been trying to do this is:

Current solution:

for(i in 1:maxScenario){

for(j in 2:nrow(dt)) {

dt[scenario == i & t == j, "R"] <- dt[scenario == i & t == j - 1, "R"]
+ dt[scenario == i & t == j, "A"] + dt[scenario == i & t == j, "B"]

} # end for loop for t

} # end for loop for scenario

The distinction here is that after the "<-" I'm using j - 1 instead of j for R to retrieve the previous row's value.

Question

I realize this is adding a lot of computation time, and is a pretty rough way to go about this. Is there a better way within the data.table package to do this? I have tried using shift() but ran into problems there. Using shift() doesn't "recalculate" the columns based on A and B.

I have considered using a recursive formula, but I wasn't sure what that would do to efficiency and run time. Ideally, I'm hoping to run about 100K scenarios and need these calculations tacked on after the stochastic scenarios are completed.

Thanks!

Edit: Example

Here's an attempt at a small example. Each row's value of R depends on the value from the previous row.

Edit 2: Further Clarification

I was finally able to translate my actual problem function into algebra:

R_t = λ * P_t + λ * R_t-1 - min{λ * P_t + λ * R_t-1, D_t} - A(t) * max{λ * P_t + λ * R_t-1 - D_t - M_t, 0} where P_t, D_t, and M_t are other known columns and A(t) is an indicator function that returns 0 when t % 4 is != 0, and 1 otherwise.

Is there a way to use shift() and cumsum() with such a nested equation?

Upvotes: 1

Answers (3)

chinsoon12

Reputation: 25225

Here is an option using Rcpp with data.table as its easier to think/code in cpp for recursive equation:

DT[, A := +(t %% 4 == 0)]

library(Rcpp)    
cppFunction('NumericVector recur(double lambda, NumericVector P, 
    NumericVector D, NumericVector M, NumericVector A) {
        int sz = P.size(), t;
        NumericVector R(sz);

        for (t=1; t<sz; t++) {
            R[t] = lambda * P[t] + lambda * R[t-1] -
                std::min(lambda * P[t] + lambda * R[t-1], D[t]) -
                A[t] * std::max(lambda * P[t] * lambda * R[t-1] - D[t] - M[t], 0.0);
        }

    return(R);
}')

DT[, R := recur(lambda, P, D, M, A)]

output:

     t            P           D          M A           R
 1:  1  1.262954285  0.25222345 -0.4333103 0  0.00000000
 2:  2 -0.326233361 -0.89192113 -0.6494716 0  0.72880445
 3:  3  1.329799263  0.43568330  0.7267507 0  0.59361856
 4:  4  1.272429321 -1.23753842  1.1519118 1  1.89610128
 5:  5  0.414641434 -0.22426789  0.9921604 0  1.37963924
 6:  6 -1.539950042  0.37739565 -0.4295131 0  0.00000000
 7:  7 -0.928567035  0.13333636  1.2383041 0  0.00000000
 8:  8 -0.294720447  0.80418951 -0.2793463 1  0.00000000
 9:  9 -0.005767173 -0.05710677  1.7579031 0  0.05422319
10: 10  2.404653389  0.50360797  0.5607461 0  0.72583032
11: 11  0.763593461  1.08576936 -0.4527840 0  0.00000000
12: 12 -0.799009249 -0.69095384 -0.8320433 1 -1.23154792
13: 13 -1.147657009 -1.28459935 -1.1665705 0  0.09499689
14: 14 -0.289461574  0.04672617 -1.0655906 0  0.00000000
15: 15 -0.299215118 -0.23570656 -1.5637821 0  0.08609900
16: 16 -0.411510833 -0.54288826  1.1565370 1  0.38018234

data:

library(data.table)    
set.seed(0L)
nr <- 16L
DT <- data.table(t=1L:nr, P=rnorm(nr), D=rnorm(nr), M=rnorm(nr))
lambda <- 0.5

Upvotes: 1

Oliver

Reputation: 8572

There is to my knowledge there is no way to iteratively calculate the rows with buildin functions from data.table. I even believe there's a duplicate question out there, that has a similar question (although I cannot find it right now).

We can however speed up the calculations by noting the tricks we could use in the formulation. First to obtain the result in the example provided, we can note this is just cumsum(shift(A, 1, fill = 0) + shift(B, 1, fill = 0))

dt <- fread('t  R  A  B
1  0  1  2
2  3  2  3
3  8  2  5
4  15 8  5
5  28 10 8') 
dt[, R2 := cumsum(shift(A, 1, fill = 0) + shift(B, 1, fill = 0))]
dt
   t  R  A B R2
1: 1  0  1 2  0
2: 2  3  2 3  3
3: 3  8  2 5  8
4: 4 15  8 5 15
5: 5 28 10 8 28

However for the exact problem described R_t = A_t + B_t + R_t-1 we will have to be a bit smarter

dt[, R3 := cumsum(A + B) - head(A + B, 1)]
dt
   t  R  A B R2 R3
1: 1  0  1 2  0  0
2: 2  3  2 3  3  5
3: 3  8  2 5  8 12
4: 4 15  8 5 15 25
5: 5 28 10 8 28 43

Which follows the above description. Note that I remove the first row, with the assumption that R₀ = 0, otherwise it simply becomes cumsum(A + B)

Edit

As the question is asking for some possibly more complicated situations, I'll add an example using a slower (but more general) example. The idea here is to use the set function, to avoid intermediary shallow copes (see help(set) or help("datatable-optimize")).

dt[, R4 := 0]
for(i in seq.int(2, dt[, .N])){
  #dummy complicated scenario
  f <- dt[seq(i), lm(A ~ B - 1)]
  set(dt, i, 'R4', unname(unlist(coef(f))))
}
dt
t  R  A B R2 R3        R4
1: 1  0  1 2  0  0 0.0000000
2: 2  3  2 3  3  5 0.6153846
3: 3  8  2 5  8 12 0.4736842
4: 4 15  8 5 15 25 0.9206349
5: 5 28 10 8 28 43 1.0866142

Upvotes: 1

Wimpel

Reputation: 27732

This creates a new column R2 wirth the same values as R

DT[, R2 := shift( cumsum(A+B), type = "lag", fill = 0 ) ][]

#    t  R  A B R2
# 1: 1  0  1 2  0
# 2: 2  3  2 3  3
# 3: 3  8  2 5  8
# 4: 4 15  8 5 15
# 5: 5 28 10 8 28

Upvotes: 1

Iteratively calculate columns of a data.table one row at a time (recursive column definitions)

Background/Example

Current solution:

Question

Edit: Example

Edit 2: Further Clarification

Answers (3)

Edit

Related Questions