David Foster
David Foster

Reputation: 11

Calculations on each row of a data.table in r

I am doing environmental modelling for school using R and have come so far thanks to DataCamp and all of the tremendously helpful threads on SO, but am at a skills threshold, and a system resource impasse. Modelling forest stands, I have already written code to model growth, and now need to calculate other resource movement between those timesteps. For resource efficiency, I must separate the growth from resource movement.

I have thousands of stands growing over hundreds of years, and those stands will split as they are disturbed, so it is a fairly large dataset, but it is simplified as:

stands <- data.table(A=c(rep(1,3),rep(2,3),rep(3,3)),B=rep(1:3,3),C=round(runif(9),2),D=c(1.5,NA,NA,2.5,NA,NA,1.2,NA,NA),E=c(9.2,NA,NA,8.7,NA,NA,7.8,NA,NA))

   A B    C   D   E
1: 1 1 0.57 1.5 9.2
2: 1 2 0.82  NA  NA
3: 1 3 0.07  NA  NA
4: 2 1 0.13 2.5 8.7
5: 2 2 0.29  NA  NA
6: 2 3 0.04  NA  NA
7: 3 1 0.93 1.2 7.8
8: 3 2 0.01  NA  NA
9: 3 3 0.49  NA  NA

in which A is the stand ID (str), B is the timestep (num), C is a value looked up based on prior calculations, and D & E are computed variables except in the first timestep for each stand, as illustrated above. My goal is to fill in all of the NAs

The formulae are different for D and E, and refer to other columns in the same row, and in the previous timestep for the same stand. e.g., calculating stands[2,"D"] will require references to both stands[2,"C"] and stands[1,"D"].

I inherited some code that uses for loops and mostly base r code to compute based on same-row and previous-row references. e.g.:

for(i in 1:nrow(stands)) {
stands$D[i] <- 0.1234*stands$D[i-1]
}

This is completely functional, but highly inefficient and a single model run with 1.3x106 stands was estimated to take a month. My final data set once I get into full model runs will be closer to 2.5-3.0x106. Obviously, some work was required.

I experimented with data.table and rewrote the code to use replacement by reference within for loops. I handled the lagged values needed for calculation as follows:

for(i in 1:nrow(stands)) {
stands[(i-1):i,lagD := shift(D)][i,D := 0.1234*lagD][,lagD := NULL]
}

(with additional steps to reinitialize the loop to start correctly with each stand)

This worked! It brought model run time down to an estimated 10 hours, or about 0.02 seconds per row of values (according to Sys.time testing I ran). But I want to see if I can push it further, because that will be closer to 20-25 hours when I go for a full model run if my processing-time scales linearly (which it should). This might be acceptable, but I would like to be able to see a run within a business day.

I believe the greatest time suck in the above data.table code is the addition of the lagged [i-1] values to the row currently being calculated [i], the later removal.

I have poured over the SO forum and other sources looking for hints on how this might work, experimented with melting, tried to wrap my head around the apply family for this purpose, and more, but I cannot find a next step to further improve efficiency here.

Help!

Edit just to add that the total number of computed columns is 16 and many calculations depend on values in other columns (both in the same row, and a lagged row), meaning I can't just calculate each column all at once.

Edit2: Sorry for the lack of specificity, I was previously criticized for not making an example generic enough, so I guess I went too far on this one. Makes sense that it would be easier to help with more of the specifics!

An excerpt from the original code block calculating values follows, in which stands[A:D] are values attached to the table before starting calculations, stands[a:e] are the values to be calculated, and lookup[a:d] is a separate dt with supplied constants.

for(i in 1:nrow(stands)) {
stands$a[i] <- (stands$A[i]*123 + stands$b[i-1]) * (1-lookup$a)
stands$a2[i] <- stands$a[i] * 123
stands$b[i] <- stands$a[i] + stands$a2[i]
stands$b[i] <- (stands$B[i]*123 + stands$b[i-1]) * (1-lookup$b)
stands$b2[i] <- stands$b[i] * 123
stands$c[i] <- stands$c[i] + stands$c2[i]
stands$d[i] <- (stands$C[i]*123 + stands$D[i]*123 + stands$c[i-1]) * (1-lookup$d)
stands$e[i] <- (stands$D[i]*123 + stands$e[i-1]) * (1-lookup$e)
}

Upvotes: 1

Views: 912

Answers (1)

Andy Baxter
Andy Baxter

Reputation: 7626

I think your only hope for improved efficiency here is to try to get rid of the for loops entirely and take advantage of R's vectorisation. The reason for loops are notoriously slow is that they essentially break up the cycles of super-fast calculations by stopping them to calculate and pass new values. If possible it works much faster if you can pass columns (or lagged columns) to calculate all rows at once.

The problem you have in your dataset of course is that several columns are dependent on the previous row being calculated first. There's no clear way of vectorising this, and so each row waiting for the last row's calculation to complete before it knows its starting values is what's causing it to slow down.

Your best approach then would potentially take a few steps:

  • using equations to calculate what a later value might be, not depending on a prior value being calculated
  • calculating all rows of each column in a single step where possible
  • adding columns in several steps

To use your simple example of D columns above, but trying it on a data.table of 100,000 rows:

library(data.table)

# Constructing a large table
stands <- data.table(
  A = rep(1:1000, each = 100),
  B = rep(1:100, times = 1000),
  C = round(runif(100000), 2)
)

# Adding a D column with only values for the B = 1
D_vals <- data.table(A = 1:1000, B = 1, D = runif(1000) + 1)
stands <- merge(stands, D_vals, all.x = TRUE)

# A *very* slow for loop
for(i in 2:nrow(stands)) {
  stands$D[i] <- 0.1234*stands$D[i-1]
}

This loop takes a very long time to run.

But in this example, each D-update, starting from the first row of each ID with the only non-missing D measurement, effectively multiples the start D by 0.1234 B-1 times. So a super-quick data.table way of doing this would be:

stands[,baseD := max(D, na.rm = TRUE), by = "A"
       ][,D := baseD * 0.1234 ^ (B - 1)
         ][,baseD := NULL]

... which runs in a fraction of a second. This might take care of some of your columns.

Other columns from your code above can be easily added with a similar speed without a for loop. To take some of the simpler calculations from above:

stands[,`:=`(
  a2 = a * 123,
  b = a + a2,
  c = c + c2,
  # etc.
  )]

You could potentially speed up your code a great deal by combining these steps, but it might be quite complex and ultimately tied to your own data/project on what can be done. Sorry that this doesn't solve all your problems quite yet, but hopefully that's a pointer in the direction you want to go.

Edit - or use Rcpp

Having said all of that, I think you can probably reach a pretty high level of speed by rewriting your code in an Rcpp function. Might be complicated how you pass your data.table into and out of the function call, but in effect looping through it in Rcpp rather than R achieves all the parts you want to without the slowness of an R loop.

As an attempt to do some of what I think your loop is doing in C++, here's a rewrite of your loop (with made-up lookup table within the function), which runs instantly over 100,000 rows:

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
DataFrame updateDf(const DataFrame& df) {

  NumericVector A = df["A"];
  NumericVector B = df["B"];
  NumericVector C = df["C"];
  NumericVector D = df["D"];
  NumericVector E = df["E"];
  NumericVector a = df["a"];
  NumericVector a2 = df["a"];
  NumericVector b = df["b"];
  NumericVector b2 = df["b2"];
  NumericVector c = df["c"];
  NumericVector c2 = df["c"];
  NumericVector d = df["d"];
  NumericVector e = df["e"];
  
  int n = A.size();
  
    double lookup_a = 0.1;
    double lookup_b = 0.2;
    double lookup_d = 0.4;
    double lookup_e = 0.5;
    
  for(int i = 1; i < n; i++) {
    a[i] = (A[i]*123 + b[i-1]) * (1-lookup_a);
    a2[i] = a[i] * 123;
    b[i] = a[i] + a2[i];
    b[i] = (B[i]*123 + b[i-1]) * (1-lookup_b);
    b2[i] = b[i] * 123;
    c[i] = c[i] + c2[i];
    d[i] = (C[i]*123 + D[i]*123 + c[i-1]) * (1-lookup_d);
    e[i] = (D[i]*123 + e[i-1]) * (1-lookup_e);
  };
  
  DataFrame out = DataFrame::create(
  _["A"] = A,
  _["B"] = B,
  _["C"] = C,
  _["D"] = D,
  _["E"] = E,
  _["a"] = a,
  _["a2"] = a,
  _["b"] = b,
  _["b2"] = b2,
  _["c"] = c,
  _["c2"] = c,
  _["d"] = d,
  _["e"] = e
  );
  
  return out;
}

Save this as a separate file (such as "updateDf.cpp") and source with Rcpp::sourceCpp("updateDf.cpp") to load the function. This should then be callable on your stands dataframe to run the loop. You'll need to play around with it to make it fit your expectations. I found this page pretty helpful!

Upvotes: 2

Related Questions