Saurabh Dhapodkar
Saurabh Dhapodkar

Reputation: 3

How to avoid for Loop in R while computing on subset data sets

I am having the below data set in R (it is having 3,700,000 rows):

Data Image

Bday_1  Bday_2  Bday_3  ….  Bday_31 HD_1  HD_2  HD_3    ……… HD_31
1.60    12.23   3.72        4.00    0    1      0       1
7.11    5.75    1.40        67.81   0    0      0       0
6.73    3.41    0.96        63.72   1    0      0       1
2.12    6.47    2.09        81.09   1    0      0       1
4.28    9.48    6.10        46.09   0    1      0       0

I wish to calculate another variable having 31 columns with the following rules:

Eg: Just for the first row

chk = ceiling(Bday_3) = 4
temp_vector = "HD_3" "HD_4" "HD_5" "HD_6" (it is 4 HDs after the starting number)
Cday_3 = Bday_3 + sum(temp_vector)

I have written for loop for it as below. It is working; but for data size of this magnitude, it is taking like more than 24 hours to execute. Is there a simpler way ?

for (k in 1: nrow(FBF_SLA)){
  for (i in 1:31){
    days_check = ceiling(FBF_SLA[k,c(BDays_SLA[i])])
    temp_vector = paste(rep("HD",days_check),c(i : (i + days_check - 1)),sep = "_")
    FBF_SLA[k,c(CDays_SLA[i])] = FBF_SLA[k,c(BDays_SLA[i])] + unname(rowSums(FBF_SLA[k,c(temp_vector)]))
  }
}

Upvotes: 0

Views: 132

Answers (1)

bgoldst
bgoldst

Reputation: 35314

This was a tricky problem! As you discovered, the main challenge here is performance. R is an excellent language, but probably its greatest weakness is that it makes it very easy to write slow code, and sometimes very difficult to write fast code.

The cardinal rule of performance optimization in R is to vectorize your operations. Your double-loop performs one calculation per output cell, which is very costly, and can be improved by rewriting the calculation to work on the entirety of the 3,700,000-row input columns as vector operands.

There was a relatively unusual challenge with your specific problem, however. The set of HD_* input columns that must be summed for the RHS of your Cday addition operation varies by calculation. In other words, the exact HD_* columns that you need to index out of the input data.frame for the sum are different for each output cell. It would be very costly to perform a separate index operation and sum for every output cell calculation.

I realized that since each sum always works on a left-to-right extent of HD_* columns, we can precompute the cumsum() of all HD_* columns in each row into a temporary variable. Then, to get the sum of any extent, we can simply index out the final cumsum value of the extent and then subtract off the cumsum value from one column prior to the start of the extent. This must be done separately for each base column index, so we need to use lapply() over column indexes 1 to NC. To support the subtraction operation for the first column, we also need to prepend a column of zeroes. In my code, I store this precomputed data in hdcums.

These index operations can be vectorized using the index matrix form of indexing. Thus, all we need to do is build an index matrix with one row per hdcums row, with the row index as the linear sequence from 1 to NR, and the column index as the base column index plus the number of columns in the extent, which is your ceiling(Bday_*) formula. We also need to wrap it in pmin(NC+1L,...) in case the extent extends outside the HD_* column range (note: the +1L is because of the prepended zero column).

Thus, the vectorized calculation consists of the entire Bday_* column, plus the vector of final cumsum values produced by indexing hdcums with the index matrix, subtract the prior cumsum values produced by indexing hdcums with the base column index.


To work on this problem I generated some random test data using rlnorm() for the Bday_* values and sample() for the HD_* values. My decision to use rlnorm() was based on the fact that you have all positive values in Bday_*, and I wanted to get something like a normal distribution, which the log-normal distribution is perfect for. I also used an increasing sdlog along the Bday_* columns since your Bday_* values appear to get larger from left-to-right (at least in Bday_31), but I admit that wasn't entirely necessary.

Below I present a demo using 10 rows and 6 columns each of Bday_* and HD_*.

set.seed(2);
NC <- 6L;
NR <- 10L;
df <- as.data.frame(matrix(c(round(rlnorm(NR*NC,rep(0.5+seq_len(NC)*0.1,each=NR),0.3),2),sample(0:1,NR*NC,replace=T)),NR,dimnames=list(NULL,c(paste0('Bday_',seq_len(NC)),paste0('HD_',seq_len(NC))))));
df;
##    Bday_1 Bday_2 Bday_3 Bday_4 Bday_5 Bday_6 HD_1 HD_2 HD_3 HD_4 HD_5 HD_6
## 1    1.39   2.28   4.17   3.07   2.42   2.34    0    0    0    0    1    1
## 2    1.93   2.70   1.55   2.71   1.51   5.58    0    1    0    0    1    0
## 3    2.93   1.79   3.59   3.40   2.11   2.54    1    0    1    1    0    0
## 4    1.30   1.47   4.00   2.26   4.81   4.40    1    1    0    0    1    1
## 5    1.78   3.44   2.23   1.95   3.28   2.19    0    0    0    1    0    0
## 6    1.90   1.01   1.07   2.06   4.94   1.67    1    1    1    0    1    1
## 7    2.25   2.62   2.57   1.47   2.48   2.73    1    1    1    1    0    0
## 8    1.70   2.04   1.86   1.88   2.65   3.98    0    1    0    0    1    0
## 9    3.30   2.73   2.82   2.08   2.57   4.23    1    0    0    0    0    1
## 10   1.75   2.29   2.43   2.28   1.90   4.96    0    0    0    0    1    1
hdcums <- cbind(0,t(apply(df[grep('^HD_',names(df))],1,cumsum)));
hdcums;
##         HD_1 HD_2 HD_3 HD_4 HD_5 HD_6
##  [1,] 0    0    0    0    0    1    2
##  [2,] 0    0    1    1    1    2    2
##  [3,] 0    1    1    2    3    3    3
##  [4,] 0    1    2    2    2    3    4
##  [5,] 0    0    0    0    1    1    1
##  [6,] 0    1    2    3    3    4    5
##  [7,] 0    1    2    3    4    4    4
##  [8,] 0    0    1    1    1    2    2
##  [9,] 0    1    1    1    1    1    2
## [10,] 0    0    0    0    0    1    2
df[,paste0('Cday_',seq_len(NC))] <- lapply(seq_len(NC),function(col) df[[col]] + hdcums[matrix(c(seq_len(NR),pmin(NC+1L,col+ceiling(df[[col]]))),NR,2L)] - hdcums[,col]);
df;
##    Bday_1 Bday_2 Bday_3 Bday_4 Bday_5 Bday_6 HD_1 HD_2 HD_3 HD_4 HD_5 HD_6 Cday_1 Cday_2 Cday_3 Cday_4 Cday_5 Cday_6
## 1    1.39   2.28   4.17   3.07   2.42   2.34    0    0    0    0    1    1   1.39   2.28   6.17   5.07   4.42   3.34
## 2    1.93   2.70   1.55   2.71   1.51   5.58    0    1    0    0    1    0   2.93   3.70   1.55   3.71   2.51   5.58
## 3    2.93   1.79   3.59   3.40   2.11   2.54    1    0    1    1    0    0   4.93   2.79   5.59   4.40   2.11   2.54
## 4    1.30   1.47   4.00   2.26   4.81   4.40    1    1    0    0    1    1   3.30   2.47   6.00   4.26   6.81   5.40
## 5    1.78   3.44   2.23   1.95   3.28   2.19    0    0    0    1    0    0   1.78   4.44   3.23   2.95   3.28   2.19
## 6    1.90   1.01   1.07   2.06   4.94   1.67    1    1    1    0    1    1   3.90   3.01   2.07   4.06   6.94   2.67
## 7    2.25   2.62   2.57   1.47   2.48   2.73    1    1    1    1    0    0   5.25   5.62   4.57   2.47   2.48   2.73
## 8    1.70   2.04   1.86   1.88   2.65   3.98    0    1    0    0    1    0   2.70   3.04   1.86   2.88   3.65   3.98
## 9    3.30   2.73   2.82   2.08   2.57   4.23    1    0    0    0    0    1   4.30   2.73   2.82   3.08   3.57   5.23
## 10   1.75   2.29   2.43   2.28   1.90   4.96    0    0    0    0    1    1   1.75   2.29   3.43   4.28   3.90   5.96

And here's a full-size performance test with system.time() calls on each of the 3 time-consuming lines:

set.seed(2);
NC <- 31L;
NR <- 3.7e6L;
system.time({ df <- as.data.frame(matrix(c(round(rlnorm(NR*NC,rep(0.5+seq_len(NC)*0.1,each=NR),0.3),2),sample(0:1,NR*NC,replace=T)),NR,dimnames=list(NULL,c(paste0('Bday_',seq_len(NC)),paste0('HD_',seq_len(NC)))))); });
##    user  system elapsed
##  29.937   2.906  32.853
head(df); tail(df);
##   Bday_1 Bday_2 Bday_3 Bday_4 Bday_5 Bday_6 Bday_7 Bday_8 Bday_9 Bday_10 Bday_11 Bday_12 Bday_13 Bday_14 Bday_15 Bday_16 Bday_17 Bday_18 Bday_19 Bday_20 Bday_21 Bday_22 Bday_23 Bday_24 Bday_25 Bday_26 Bday_27 Bday_28 Bday_29 Bday_30 Bday_31 HD_1 HD_2 HD_3 HD_4 HD_5 HD_6 HD_7 HD_8 HD_9 HD_10 HD_11 HD_12 HD_13 HD_14 HD_15 HD_16 HD_17 HD_18 HD_19 HD_20 HD_21 HD_22 HD_23 HD_24 HD_25 HD_26 HD_27 HD_28 HD_29 HD_30 HD_31
## 1   1.39   1.64   1.45   2.29   2.62   3.26   2.34   3.91   2.65    6.41    5.73    3.92    5.11   10.65    5.51    8.10    7.41    8.07    9.56   11.83    9.05    8.81   16.54   23.12   26.23   31.42   26.89   19.91   31.12   38.50   47.66    0    1    1    1    0    1    0    0    0     1     1     1     1     1     0     1     1     1     1     1     0     1     1     0     1     1     1     1     1     1     0
## 2   1.93   3.26   2.45   1.92   4.06   2.01   2.27   6.61   6.55    5.42    5.76    5.71    3.65    6.55    6.52   10.95    7.20   11.13    7.16   17.41   16.56   16.35   20.21   14.15   19.80   12.96   12.05   15.69   40.81   29.15   48.28    0    1    1    1    0    1    1    1    1     1     0     0     0     0     0     1     1     1     1     0     0     0     1     0     1     0     1     1     1     1     1
## 3   2.93   1.77   2.38   2.22   3.29   3.93   3.16   2.36   5.92    2.51    3.98    5.41    7.65    8.40    6.91   11.61   10.77   17.68   11.76   10.51   10.07    9.65   14.99   17.15   18.05   16.85   21.61   46.14   33.15   43.76   40.10    1    0    1    0    1    1    0    1    0     1     0     0     1     0     1     0     0     1     1     0     0     1     1     1     1     1     0     0     0     0     1
## 4   1.30   2.33   2.28   2.26   2.14   4.53   2.43   8.69   2.47    2.22    4.95    6.17    5.59    4.61    8.52   10.94    8.59   15.78   10.07   15.32   10.36   23.15   10.73   20.32   18.99   24.20   19.67   23.73   17.06   28.68   22.02    0    1    0    0    1    0    1    0    1     1     1     1     0     1     0     1     0     1     0     1     1     0     0     1     0     0     1     0     0     0     0
## 5   1.78   2.01   2.62   3.41   3.99   4.59   4.40   2.04   5.61    2.61    4.62   11.14    3.82    3.98    6.19   10.00   11.32   12.49   11.20   10.57   30.62   10.36   18.64   26.65   13.70   25.38   47.55   48.51   22.14   65.28   31.64    0    0    1    0    0    1    1    0    0     0     0     1     0     1     0     1     0     1     1     0     1     1     0     0     0     1     0     1     1     1     1
## 6   1.90   2.10   1.32   2.55   3.17   3.66   5.37   4.61   4.05    3.15    5.14    6.32    7.09    5.21    8.47    8.49    3.59    6.44    7.73   11.83   12.47   18.67   10.56   18.36   13.59   36.21   13.16   26.77   30.47   39.61   45.86    0    0    0    1    1    1    0    1    0     1     0     1     1     1     1     1     0     0     0     0     1     1     1     1     1     0     1     1     0     0     1
##         Bday_1 Bday_2 Bday_3 Bday_4 Bday_5 Bday_6 Bday_7 Bday_8 Bday_9 Bday_10 Bday_11 Bday_12 Bday_13 Bday_14 Bday_15 Bday_16 Bday_17 Bday_18 Bday_19 Bday_20 Bday_21 Bday_22 Bday_23 Bday_24 Bday_25 Bday_26 Bday_27 Bday_28 Bday_29 Bday_30 Bday_31 HD_1 HD_2 HD_3 HD_4 HD_5 HD_6 HD_7 HD_8 HD_9 HD_10 HD_11 HD_12 HD_13 HD_14 HD_15 HD_16 HD_17 HD_18 HD_19 HD_20 HD_21 HD_22 HD_23 HD_24 HD_25 HD_26 HD_27 HD_28 HD_29 HD_30 HD_31
## 3699995   2.02   1.39   1.61   3.57   2.33   4.29   3.75   2.34   2.22    3.71    4.65    4.87    7.42    6.64    7.45    4.88    9.43    9.38    8.78    9.19   16.88   12.26   11.94   13.79   22.19   26.93   30.02   22.12   57.83   33.74   45.20    0    1    1    1    0    1    0    1    1     0     0     1     0     0     0     1     0     1     0     1     0     1     0     1     0     1     0     0     0     1     0
## 3699996   2.44   2.35   2.52   2.88   2.63   2.30   4.33   2.88   4.67    3.87    3.76    5.94    7.09    4.94    6.10    9.11   11.67   10.41   18.27   11.97   12.27   13.84   19.16   15.52   15.90   37.48   21.96   19.86   30.20   23.79   56.55    0    0    0    1    0    1    0    1    1     0     0     0     1     0     1     1     1     0     1     1     1     1     0     1     1     1     1     0     0     0     1
## 3699997   3.48   2.19   2.27   2.67   1.41   4.02   2.02   3.14   5.40    2.30    7.51    4.96    4.06    8.74    6.45   11.91    8.70    7.13   14.66    8.99   13.88   12.94   23.77   16.46   16.18   22.91   25.20   24.44   35.57   52.91   47.57    0    1    0    0    1    0    0    0    0     0     0     1     0     1     1     1     0     1     0     1     1     1     0     0     1     1     1     1     1     0     1
## 3699998   2.68   1.63   1.55   1.62   3.02   3.00   2.92   3.10   3.59    4.78    5.99    5.50    5.75   12.67    9.35    7.57    9.35    7.43    5.41    9.15   10.92   21.05   12.65   25.29   21.00   18.63   15.57   30.96   32.45   18.39   37.04    0    1    1    0    1    1    0    1    0     0     0     1     1     1     1     0     1     1     1     1     1     0     1     0     0     0     1     1     0     1     1
## 3699999   1.62   2.17   2.95   1.63   5.78   3.49   3.52   4.59   6.45    4.79    7.41    3.93    6.91   10.42    8.75    8.46    5.48    8.64   10.62   22.62   14.13   13.12   10.19   21.40   26.65   31.67   31.14   26.52   21.06   61.38   37.07    1    0    1    0    1    1    1    0    0     1     0     1     0     1     1     0     1     0     0     1     1     1     0     1     1     0     0     0     1     0     1
## 3700000   2.00   2.54   1.45   2.94   2.56   3.49   3.40   2.52   4.23    4.71   11.82    8.48    5.34    6.82   10.67   16.26    9.16   10.27   15.61   10.46   24.70   18.43   17.41   15.70   13.10   24.01   19.12   25.59   32.41   32.45   32.74    1    0    1    1    0    1    1    0    1     0     1     1     0     0     1     1     0     0     1     1     0     1     1     1     0     1     1     0     1     1     0
system.time({ hdcums <- cbind(0,t(apply(df[grep('^HD_',names(df))],1,cumsum))); });
##    user  system elapsed
##  67.640   2.063  69.720
head(hdcums); tail(hdcums);
##        HD_1 HD_2 HD_3 HD_4 HD_5 HD_6 HD_7 HD_8 HD_9 HD_10 HD_11 HD_12 HD_13 HD_14 HD_15 HD_16 HD_17 HD_18 HD_19 HD_20 HD_21 HD_22 HD_23 HD_24 HD_25 HD_26 HD_27 HD_28 HD_29 HD_30 HD_31
## [1,] 0    0    1    2    3    3    4    4    4    4     5     6     7     8     9     9    10    11    12    13    14    14    15    16    16    17    18    19    20    21    22    22
## [2,] 0    0    1    2    3    3    4    5    6    7     8     8     8     8     8     8     9    10    11    12    12    12    12    13    13    14    14    15    16    17    18    19
## [3,] 0    1    1    2    2    3    4    4    5    5     6     6     6     7     7     8     8     8     9    10    10    10    11    12    13    14    15    15    15    15    15    16
## [4,] 0    0    1    1    1    2    2    3    3    4     5     6     7     7     8     8     9     9    10    10    11    12    12    12    13    13    13    14    14    14    14    14
## [5,] 0    0    0    1    1    1    2    3    3    3     3     3     4     4     5     5     6     6     7     8     8     9    10    10    10    10    11    11    12    13    14    15
## [6,] 0    0    0    0    1    2    3    3    4    4     5     5     6     7     8     9    10    10    10    10    10    11    12    13    14    15    15    16    17    17    17    18
##              HD_1 HD_2 HD_3 HD_4 HD_5 HD_6 HD_7 HD_8 HD_9 HD_10 HD_11 HD_12 HD_13 HD_14 HD_15 HD_16 HD_17 HD_18 HD_19 HD_20 HD_21 HD_22 HD_23 HD_24 HD_25 HD_26 HD_27 HD_28 HD_29 HD_30 HD_31
## [3699995,] 0    0    1    2    3    3    4    4    5    6     6     6     7     7     7     7     8     8     9     9    10    10    11    11    12    12    13    13    13    13    14    14
## [3699996,] 0    0    0    0    1    1    2    2    3    4     4     4     4     5     5     6     7     8     8     9    10    11    12    12    13    14    15    16    16    16    16    17
## [3699997,] 0    0    1    1    1    2    2    2    2    2     2     2     3     3     4     5     6     6     7     7     8     9    10    10    10    11    12    13    14    15    15    16
## [3699998,] 0    0    1    2    2    3    4    4    5    5     5     5     6     7     8     9     9    10    11    12    13    14    14    15    15    15    15    16    17    17    18    19
## [3699999,] 0    1    1    2    2    3    4    5    5    5     6     6     7     7     8     9     9    10    10    10    11    12    13    13    14    15    15    15    15    16    16    17
## [3700000,] 0    1    1    2    3    3    4    5    5    6     6     7     8     8     8     9    10    10    10    11    12    12    13    14    15    15    16    17    17    18    19    19
system.time({ df[,paste0('Cday_',seq_len(NC))] <- lapply(seq_len(NC),function(col) df[[col]] + hdcums[matrix(c(seq_len(NR),pmin(NC+1L,col+ceiling(df[[col]]))),NR,2L)] - hdcums[,col]); });
##    user  system elapsed
##  11.750   1.266  13.031
head(df); tail(df);
##   Bday_1 Bday_2 Bday_3 Bday_4 Bday_5 Bday_6 Bday_7 Bday_8 Bday_9 Bday_10 Bday_11 Bday_12 Bday_13 Bday_14 Bday_15 Bday_16 Bday_17 Bday_18 Bday_19 Bday_20 Bday_21 Bday_22 Bday_23 Bday_24 Bday_25 Bday_26 Bday_27 Bday_28 Bday_29 Bday_30 Bday_31 HD_1 HD_2 HD_3 HD_4 HD_5 HD_6 HD_7 HD_8 HD_9 HD_10 HD_11 HD_12 HD_13 HD_14 HD_15 HD_16 HD_17 HD_18 HD_19 HD_20 HD_21 HD_22 HD_23 HD_24 HD_25 HD_26 HD_27 HD_28 HD_29 HD_30 HD_31 Cday_1 Cday_2 Cday_3 Cday_4 Cday_5 Cday_6 Cday_7 Cday_8 Cday_9 Cday_10 Cday_11 Cday_12 Cday_13 Cday_14 Cday_15 Cday_16 Cday_17 Cday_18 Cday_19 Cday_20 Cday_21 Cday_22 Cday_23 Cday_24 Cday_25 Cday_26 Cday_27 Cday_28 Cday_29 Cday_30 Cday_31
## 1   1.39   1.64   1.45   2.29   2.62   3.26   2.34   3.91   2.65    6.41    5.73    3.92    5.11   10.65    5.51    8.10    7.41    8.07    9.56   11.83    9.05    8.81   16.54   23.12   26.23   31.42   26.89   19.91   31.12   38.50   47.66    0    1    1    1    0    1    0    0    0     1     1     1     1     1     0     1     1     1     1     1     0     1     1     0     1     1     1     1     1     1     0   2.39   3.64   3.45   4.29   3.62   4.26   2.34   5.91   4.65   12.41   10.73    6.92   10.11   18.65   10.51   15.10   13.41   15.07   17.56   20.83   17.05   16.81   23.54   29.12   32.23   36.42   30.89   22.91   33.12   39.50   47.66
## 2   1.93   3.26   2.45   1.92   4.06   2.01   2.27   6.61   6.55    5.42    5.76    5.71    3.65    6.55    6.52   10.95    7.20   11.13    7.16   17.41   16.56   16.35   20.21   14.15   19.80   12.96   12.05   15.69   40.81   29.15   48.28    0    1    1    1    0    1    1    1    1     1     0     0     0     0     0     1     1     1     1     0     0     0     1     0     1     0     1     1     1     1     1   2.93   6.26   4.45   2.92   8.06   5.01   5.27   9.61   8.55    6.42    6.76    7.71    4.65   10.55   10.52   16.95   11.20   18.13   10.16   24.41   23.56   23.35   27.21   20.15   25.80   17.96   17.05   19.69   43.81   31.15   49.28
## 3   2.93   1.77   2.38   2.22   3.29   3.93   3.16   2.36   5.92    2.51    3.98    5.41    7.65    8.40    6.91   11.61   10.77   17.68   11.76   10.51   10.07    9.65   14.99   17.15   18.05   16.85   21.61   46.14   33.15   43.76   40.10    1    0    1    0    1    1    0    1    0     1     0     0     1     0     1     0     0     1     1     0     0     1     1     1     1     1     0     0     0     0     1   4.93   2.77   4.38   4.22   6.29   5.93   5.16   4.36   7.92    3.51    4.98    7.41   11.65   12.40    9.91   18.61   17.77   25.68   17.76   15.51   16.07   15.65   19.99   21.15   21.05   18.85   22.61   47.14   34.15   44.76   41.10
## 4   1.30   2.33   2.28   2.26   2.14   4.53   2.43   8.69   2.47    2.22    4.95    6.17    5.59    4.61    8.52   10.94    8.59   15.78   10.07   15.32   10.36   23.15   10.73   20.32   18.99   24.20   19.67   23.73   17.06   28.68   22.02    0    1    0    0    1    0    1    0    1     1     1     1     0     1     0     1     0     1     0     1     1     0     0     1     0     0     1     0     0     0     0   2.30   3.33   3.28   3.26   4.14   7.53   4.43  14.69   5.47    5.22    7.95   10.17    8.59    7.61   12.52   15.94   12.59   20.78   14.07   19.32   13.36   25.15   12.73   22.32   19.99   25.20   20.67   23.73   17.06   28.68   22.02
## 5   1.78   2.01   2.62   3.41   3.99   4.59   4.40   2.04   5.61    2.61    4.62   11.14    3.82    3.98    6.19   10.00   11.32   12.49   11.20   10.57   30.62   10.36   18.64   26.65   13.70   25.38   47.55   48.51   22.14   65.28   31.64    0    0    1    0    0    1    1    0    0     0     0     1     0     1     0     1     0     1     1     0     1     1     0     0     0     1     0     1     1     1     1   1.78   3.01   3.62   5.41   5.99   6.59   5.40   2.04   7.61    3.61    6.62   18.14    5.82    5.98   10.19   15.00   17.32   20.49   18.20   16.57   37.62   16.36   23.64   31.65   18.70   30.38   51.55   52.51   25.14   67.28   32.64
## 6   1.90   2.10   1.32   2.55   3.17   3.66   5.37   4.61   4.05    3.15    5.14    6.32    7.09    5.21    8.47    8.49    3.59    6.44    7.73   11.83   12.47   18.67   10.56   18.36   13.59   36.21   13.16   26.77   30.47   39.61   45.86    0    0    0    1    1    1    0    1    0     1     0     1     1     1     1     1     0     0     0     0     1     1     1     1     1     0     1     1     0     0     1   1.90   3.10   2.32   5.55   6.17   5.66   8.37   7.61   7.05    6.15   10.14   11.32   11.09    8.21   13.47   13.49    3.59   10.44   12.73   19.83   20.47   25.67   16.56   23.36   17.59   39.21   16.16   28.77   31.47   40.61   46.86
##         Bday_1 Bday_2 Bday_3 Bday_4 Bday_5 Bday_6 Bday_7 Bday_8 Bday_9 Bday_10 Bday_11 Bday_12 Bday_13 Bday_14 Bday_15 Bday_16 Bday_17 Bday_18 Bday_19 Bday_20 Bday_21 Bday_22 Bday_23 Bday_24 Bday_25 Bday_26 Bday_27 Bday_28 Bday_29 Bday_30 Bday_31 HD_1 HD_2 HD_3 HD_4 HD_5 HD_6 HD_7 HD_8 HD_9 HD_10 HD_11 HD_12 HD_13 HD_14 HD_15 HD_16 HD_17 HD_18 HD_19 HD_20 HD_21 HD_22 HD_23 HD_24 HD_25 HD_26 HD_27 HD_28 HD_29 HD_30 HD_31 Cday_1 Cday_2 Cday_3 Cday_4 Cday_5 Cday_6 Cday_7 Cday_8 Cday_9 Cday_10 Cday_11 Cday_12 Cday_13 Cday_14 Cday_15 Cday_16 Cday_17 Cday_18 Cday_19 Cday_20 Cday_21 Cday_22 Cday_23 Cday_24 Cday_25 Cday_26 Cday_27 Cday_28 Cday_29 Cday_30 Cday_31
## 3699995   2.02   1.39   1.61   3.57   2.33   4.29   3.75   2.34   2.22    3.71    4.65    4.87    7.42    6.64    7.45    4.88    9.43    9.38    8.78    9.19   16.88   12.26   11.94   13.79   22.19   26.93   30.02   22.12   57.83   33.74   45.20    0    1    1    1    0    1    0    1    1     0     0     1     0     0     0     1     0     1     0     1     0     1     0     1     0     1     0     0     0     1     0   4.02   3.39   3.61   5.57   3.33   7.29   5.75   4.34   3.22    4.71    5.65    6.87   10.42    9.64   11.45    7.88   14.43   14.38   12.78   13.19   20.88   16.26   14.94   16.79   24.19   28.93   31.02   23.12   58.83   34.74   45.20
## 3699996   2.44   2.35   2.52   2.88   2.63   2.30   4.33   2.88   4.67    3.87    3.76    5.94    7.09    4.94    6.10    9.11   11.67   10.41   18.27   11.97   12.27   13.84   19.16   15.52   15.90   37.48   21.96   19.86   30.20   23.79   56.55    0    0    0    1    0    1    0    1    1     0     0     0     1     0     1     1     1     0     1     1     1     1     0     1     1     1     1     0     0     0     1   2.44   3.35   3.52   4.88   3.63   4.30   6.33   4.88   6.67    4.87    4.76    9.94   13.09    7.94   12.10   17.11   20.67   18.41   27.27   19.97   19.27   19.84   24.16   20.52   19.90   40.48   23.96   20.86   31.20   24.79   57.55
## 3699997   3.48   2.19   2.27   2.67   1.41   4.02   2.02   3.14   5.40    2.30    7.51    4.96    4.06    8.74    6.45   11.91    8.70    7.13   14.66    8.99   13.88   12.94   23.77   16.46   16.18   22.91   25.20   24.44   35.57   52.91   47.57    0    1    0    0    1    0    0    0    0     0     0     1     0     1     1     1     0     1     0     1     1     1     0     0     1     1     1     1     1     0     1   4.48   3.19   3.27   3.67   2.41   4.02   2.02   3.14   7.40    3.30   12.51    8.96    7.06   15.74   11.45   19.91   13.70   12.13   23.66   15.99   21.88   19.94   29.77   22.46   22.18   27.91   29.20   27.44   37.57   53.91   48.57
## 3699998   2.68   1.63   1.55   1.62   3.02   3.00   2.92   3.10   3.59    4.78    5.99    5.50    5.75   12.67    9.35    7.57    9.35    7.43    5.41    9.15   10.92   21.05   12.65   25.29   21.00   18.63   15.57   30.96   32.45   18.39   37.04    0    1    1    0    1    1    0    1    0     0     0     1     1     1     1     0     1     1     1     1     1     0     1     0     0     0     1     1     0     1     1   4.68   3.63   2.55   2.62   6.02   5.00   3.92   4.10   4.59    7.78    9.99   10.50   10.75   20.67   16.35   13.57   15.35   12.43    9.41   14.15   16.92   26.05   17.65   29.29   25.00   22.63   19.57   33.96   34.45   20.39   38.04
## 3699999   1.62   2.17   2.95   1.63   5.78   3.49   3.52   4.59   6.45    4.79    7.41    3.93    6.91   10.42    8.75    8.46    5.48    8.64   10.62   22.62   14.13   13.12   10.19   21.40   26.65   31.67   31.14   26.52   21.06   61.38   37.07    1    0    1    0    1    1    1    0    0     1     0     1     0     1     1     0     1     0     0     1     1     1     0     1     1     0     0     0     1     0     1   2.62   3.17   4.95   2.63   9.78   5.49   5.52   6.59  10.45    7.79   11.41    6.93    9.91   17.42   13.75   13.46    9.48   13.64   16.62   29.62   20.13   18.12   14.19   25.40   29.65   33.67   33.14   28.52   23.06   62.38   38.07
## 3700000   2.00   2.54   1.45   2.94   2.56   3.49   3.40   2.52   4.23    4.71   11.82    8.48    5.34    6.82   10.67   16.26    9.16   10.27   15.61   10.46   24.70   18.43   17.41   15.70   13.10   24.01   19.12   25.59   32.41   32.45   32.74    1    0    1    1    0    1    1    0    1     0     1     1     0     0     1     1     0     0     1     1     0     1     1     1     0     1     1     0     1     1     0   3.00   4.54   3.45   4.94   4.56   6.49   5.40   3.52   7.23    6.71   18.82   13.48    7.34   10.82   17.67   26.26   15.16   17.27   24.61   18.46   31.70   25.43   23.41   20.70   17.10   28.01   22.12   27.59   34.41   33.45   32.74

(Note that the first timing result is not important for your purposes, since that's just the generation of my random test data; I only included it for information's sake.)

As you can see, it only took about 70s to precompute hdcums, and then only 13s to perform the 31 vectorized calculations. A significant improvement over 24 hours!


I also looked into an Rcpp-based solution to see if we can improve performance further, by performing the entire computation in C++. As it turns out, we can. The below demo continues off from the above full-size performance test, copying df to df2 minus the output columns, and then recomputes the output columns on df2 using the C++ implementation function I named addBdayHDs(). I then use all.equal() to verify that df2 matches df, ignoring negligible floating-point discrepancies:

library(Rcpp);
cppFunction('
    NumericMatrix addBdayHDs(DataFrame df) {
        int NC = df.length()/2;
        NumericVector hdcums(NC+1);
        NumericMatrix Cday(df.nrows(),NC);
        for (int row = df.nrows()-1; row >= 0; --row) {
            hdcums[1] = as<NumericVector>(df(NC))[row];
            for (int col = 1; col < NC; ++col)
                hdcums[col+1] = hdcums[col]+as<NumericVector>(df(NC+col))[row];
            for (int col = NC-1; col >= 0; --col) {
                Cday(row,col) = as<NumericVector>(df(col))[row];
                Cday(row,col) += hdcums[std::min(NC,col+(int)std::ceil(Cday(row,col)))] - hdcums[col];
            }
        }
        return Cday;
    }
');
df2 <- df[seq_len(NC*2L)];
system.time({ df2[,paste0('Cday_',seq_len(NC))] <- addBdayHDs(df2); });
##    user  system elapsed
##  30.688   1.547  32.310
all.equal(df,df2);
## [1] TRUE

Upvotes: 2

Related Questions