Reputation: 65

How do I speed up my data loop calculations R?

I have made a data loop below which gives me the results I need. However, the processing time is very long. I need to analyze a big amount of data (400,000+ objects, optimally 25,000,000+), and hence I am interested if there is any way I can speed up the below calculations (a snip of the data):

My dataframe is called: crsp.comp3

Permno Observation  C.xsgaq  C.xsgaq.depr
10026      1        45.145    44.393     
10026      2        45.145    43.653     
10026      3        45.145    42.925     
10026      4        96.730    92.935     
10026      5        96.730    91.386     
10026      6        96.730    89.863     
10026      7        145.511   136.333     
10026      8        145.511   134.061     
10026      9        145.511   131.827     
10026     10        190.986   174.347

Currently, I calculate the numbers in the 'C.xsgaq.depr' column as:

for (i in 1:nrow(crsp.comp3)) {
  if (crsp.comp3[i, 2] == 1) {
    crsp.comp3[i, 4] <- crsp.comp3[i, 3]*(1 - (0.2/12))
  } else {
    crsp.comp3[i, 4] <- (crsp.comp3[i - 1, 4] + 
                           (crsp.comp3[i, 3] - crsp.comp3[i - 1, 3]))*(1 - (0.2/12))
  }
}

The observations assigned '1' need to be calculated as above, and all observations =/ 1 need to be calculated as stated in the above loop. My objective is to optimize the code so it can get processed faster. I have heard something about vectorizing the dataframe?

Thank you

Upvotes: 1

Answers (2)

AnilGoyal

Reputation: 26218

Another way of doing it iteratively through purrr::accumulate()

#add column for asset addition mid-way during the year
crsp.comp3$assetadd <- c(0, diff(crsp.comp3$C.xsgaq))

#create your new desired column iteratively
accumulate(crsp.comp3$assetadd, ~ (.x + .y)*(11.8/12), .init = crsp.comp3$C.xsgaq[1])[-1]

 [1]  44.39258  43.65271  42.92516  92.93499  91.38608  89.86297 136.33324 134.06102 131.82667 174.34664

#Or store it in new variable directly
crsp.comp3$desired_val <- accumulate(crsp.comp3$assetadd, ~ (.x + .y)*(11.8/12), .init = crsp.comp3$C.xsgaq[1])[-1]

#check it
> crsp.comp3
   Permno Observation C.xsgaq C.xsgaq.depr assetadd desired_val
1   10026           1  45.145       44.393    0.000    44.39258
2   10026           2  45.145       43.653    0.000    43.65271
3   10026           3  45.145       42.925    0.000    42.92516
4   10026           4  96.730       92.935   51.585    92.93499
5   10026           5  96.730       91.386    0.000    91.38608
6   10026           6  96.730       89.863    0.000    89.86297
7   10026           7 145.511      136.333   48.781   136.33324
8   10026           8 145.511      134.061    0.000   134.06102
9   10026           9 145.511      131.827    0.000   131.82667
10  10026          10 190.986      174.347   45.475   174.34664

data used

crsp.comp3 <- structure(list(Permno = c(10026L, 10026L, 10026L, 10026L, 10026L, 
10026L, 10026L, 10026L, 10026L, 10026L), Observation = 1:10, 
    C.xsgaq = c(45.145, 45.145, 45.145, 96.73, 96.73, 96.73, 
    145.511, 145.511, 145.511, 190.986), C.xsgaq.depr = c(44.393, 
    43.653, 42.925, 92.935, 91.386, 89.863, 136.333, 134.061, 
    131.827, 174.347)), class = "data.frame", row.names = c(NA, 
-10L))

> crsp.comp3
   Permno Observation C.xsgaq C.xsgaq.depr
1   10026           1  45.145       44.393
2   10026           2  45.145       43.653
3   10026           3  45.145       42.925
4   10026           4  96.730       92.935
5   10026           5  96.730       91.386
6   10026           6  96.730       89.863
7   10026           7 145.511      136.333
8   10026           8 145.511      134.061
9   10026           9 145.511      131.827
10  10026          10 190.986      174.347

Upvotes: 1

Jon Spring

Reputation: 66425

Performance comparison: This method is about 150x faster (8.1 sec -> 0.05 sec) for 10k observations and 1,000x faster(166 sec -> 0.15 sec) for 100k observations. I expect the performance gap will get yet larger as you use larger data. See bottom for fake data used for testing.

This can handle 25M rows of fake data in about 5 seconds. If you need faster, I suggest using data.table.

Here's an alternative using dplyr (not because it offers any particular advantages over base R here but b/c it's easier for me), relying on some algebraic manipulation. (Open to suggestions for further simplification!)

The key thing to make R fast is to structure your problem so that you can use the same calculation once across all elements of your data. That's vectorization.

library(dplyr)


crsp.comp3 %>%
  # The grouping here will make it so that the first Obs
  #   in a new group won't "see" the last Obs of the prior group. 
  #   We could just as easily group by Permno...
  mutate(group = cumsum(Observation == 1)) %>%
  group_by(group) %>%
         
  mutate(deprec = (11.8/12) ^ Observation,
         C_change =  (C.xsgaq - lag(C.xsgaq, default = 0)) /
           lag(deprec, default = 1),  # Edit: a little faster than 
                                      # (11.8/12)^(Observation-1),
         cuml = cumsum(C_change),
         output = cuml * deprec) %>%
  ungroup()

Result

# A tibble: 10 x 8
   Permno Observation C.xsgaq group deprec C_change  cuml output
    <int>       <int>   <dbl> <int>  <dbl>    <dbl> <dbl>  <dbl>
 1  10026           1    45.1     1  0.983     45.1  45.1   44.4
 2  10026           2    45.1     1  0.967      0    45.1   43.7
 3  10026           3    45.1     1  0.951      0    45.1   42.9
 4  10026           4    96.7     1  0.935     54.3  99.4   92.9
 5  10026           5    96.7     1  0.919      0    99.4   91.4
 6  10026           6    96.7     1  0.904      0    99.4   89.9
 7  10026           7   146.      1  0.889     54.0 153.   136. 
 8  10026           8   146.      1  0.874      0   153.   134. 
 9  10026           9   146.      1  0.860      0   153.   132. 
10  10026          10   191.      1  0.845     52.9 206.   174.

Testing fake data:

n = 100000
Permno = 1000
Obs = floor(n / Permno)

crsp.comp3 <- tibble(Permno = rep(1:Permno, each = Obs),
                    Observation = rep(1:Obs, length.out = n),
                    Chg = sample(c(rep(0, 10), runif(5, 1, 100)), n, replace = TRUE)) %>%
  group_by(Permno) %>%
  mutate(C.xsgaq = cumsum(Chg))  %>%
  ungroup() %>%
  select(Permno, Observation, C.xsgaq)

Upvotes: 4

How do I speed up my data loop calculations R?

Answers (2)

Related Questions