Calculations on each row of a data.table in r

Question

I am doing environmental modelling for school using R and have come so far thanks to DataCamp and all of the tremendously helpful threads on SO, but am at a skills threshold, and a system resource impasse. Modelling forest stands, I have already written code to model growth, and now need to calculate other resource movement between those timesteps. For resource efficiency, I must separate the growth from resource movement.

I have thousands of stands growing over hundreds of years, and those stands will split as they are disturbed, so it is a fairly large dataset, but it is simplified as:

stands <- data.table(A=c(rep(1,3),rep(2,3),rep(3,3)),B=rep(1:3,3),C=round(runif(9),2),D=c(1.5,NA,NA,2.5,NA,NA,1.2,NA,NA),E=c(9.2,NA,NA,8.7,NA,NA,7.8,NA,NA))

   A B    C   D   E
1: 1 1 0.57 1.5 9.2
2: 1 2 0.82  NA  NA
3: 1 3 0.07  NA  NA
4: 2 1 0.13 2.5 8.7
5: 2 2 0.29  NA  NA
6: 2 3 0.04  NA  NA
7: 3 1 0.93 1.2 7.8
8: 3 2 0.01  NA  NA
9: 3 3 0.49  NA  NA

in which A is the stand ID (str), B is the timestep (num), C is a value looked up based on prior calculations, and D & E are computed variables except in the first timestep for each stand, as illustrated above. My goal is to fill in all of the NAs

The formulae are different for D and E, and refer to other columns in the same row, and in the previous timestep for the same stand. e.g., calculating stands[2,"D"] will require references to both stands[2,"C"] and stands[1,"D"].

I inherited some code that uses for loops and mostly base r code to compute based on same-row and previous-row references. e.g.:

for(i in 1:nrow(stands)) {
stands$D[i] <- 0.1234*stands$D[i-1]
}

This is completely functional, but highly inefficient and a single model run with 1.3x10⁶ stands was estimated to take a month. My final data set once I get into full model runs will be closer to 2.5-3.0x10⁶. Obviously, some work was required.

I experimented with data.table and rewrote the code to use replacement by reference within for loops. I handled the lagged values needed for calculation as follows:

for(i in 1:nrow(stands)) {
stands[(i-1):i,lagD := shift(D)][i,D := 0.1234*lagD][,lagD := NULL]
}

(with additional steps to reinitialize the loop to start correctly with each stand)

This worked! It brought model run time down to an estimated 10 hours, or about 0.02 seconds per row of values (according to Sys.time testing I ran). But I want to see if I can push it further, because that will be closer to 20-25 hours when I go for a full model run if my processing-time scales linearly (which it should). This might be acceptable, but I would like to be able to see a run within a business day.

I believe the greatest time suck in the above data.table code is the addition of the lagged [i-1] values to the row currently being calculated [i], the later removal.

I have poured over the SO forum and other sources looking for hints on how this might work, experimented with melting, tried to wrap my head around the apply family for this purpose, and more, but I cannot find a next step to further improve efficiency here.

Help!

Edit just to add that the total number of computed columns is 16 and many calculations depend on values in other columns (both in the same row, and a lagged row), meaning I can't just calculate each column all at once.

Edit2: Sorry for the lack of specificity, I was previously criticized for not making an example generic enough, so I guess I went too far on this one. Makes sense that it would be easier to help with more of the specifics!

An excerpt from the original code block calculating values follows, in which stands[A:D] are values attached to the table before starting calculations, stands[a:e] are the values to be calculated, and lookup[a:d] is a separate dt with supplied constants.

for(i in 1:nrow(stands)) {
stands$a[i] <- (stands$A[i]*123 + stands$b[i-1]) * (1-lookup$a)
stands$a2[i] <- stands$a[i] * 123
stands$b[i] <- stands$a[i] + stands$a2[i]
stands$b[i] <- (stands$B[i]*123 + stands$b[i-1]) * (1-lookup$b)
stands$b2[i] <- stands$b[i] * 123
stands$c[i] <- stands$c[i] + stands$c2[i]
stands$d[i] <- (stands$C[i]*123 + stands$D[i]*123 + stands$c[i-1]) * (1-lookup$d)
stands$e[i] <- (stands$D[i]*123 + stands$e[i-1]) * (1-lookup$e)
}

Calculations on each row of a data.table in r

Answers (1)

Edit - or use `Rcpp`

Related Questions

Calculations on each row of a data.table in r

Answers (1)

Edit - or use Rcpp

Related Questions

Edit - or use `Rcpp`