Reputation: 65
I have made a data loop below which gives me the results I need. However, the processing time is very long. I need to analyze a big amount of data (400,000+ objects, optimally 25,000,000+), and hence I am interested if there is any way I can speed up the below calculations (a snip of the data):
My dataframe is called: crsp.comp3
Permno Observation C.xsgaq C.xsgaq.depr
10026 1 45.145 44.393
10026 2 45.145 43.653
10026 3 45.145 42.925
10026 4 96.730 92.935
10026 5 96.730 91.386
10026 6 96.730 89.863
10026 7 145.511 136.333
10026 8 145.511 134.061
10026 9 145.511 131.827
10026 10 190.986 174.347
Currently, I calculate the numbers in the 'C.xsgaq.depr' column as:
for (i in 1:nrow(crsp.comp3)) {
if (crsp.comp3[i, 2] == 1) {
crsp.comp3[i, 4] <- crsp.comp3[i, 3]*(1 - (0.2/12))
} else {
crsp.comp3[i, 4] <- (crsp.comp3[i - 1, 4] +
(crsp.comp3[i, 3] - crsp.comp3[i - 1, 3]))*(1 - (0.2/12))
}
}
The observations assigned '1' need to be calculated as above, and all observations =/ 1 need to be calculated as stated in the above loop. My objective is to optimize the code so it can get processed faster. I have heard something about vectorizing the dataframe?
Thank you
Upvotes: 1
Views: 114
Reputation: 26218
Another way of doing it iteratively through purrr::accumulate()
#add column for asset addition mid-way during the year
crsp.comp3$assetadd <- c(0, diff(crsp.comp3$C.xsgaq))
#create your new desired column iteratively
accumulate(crsp.comp3$assetadd, ~ (.x + .y)*(11.8/12), .init = crsp.comp3$C.xsgaq[1])[-1]
[1] 44.39258 43.65271 42.92516 92.93499 91.38608 89.86297 136.33324 134.06102 131.82667 174.34664
#Or store it in new variable directly
crsp.comp3$desired_val <- accumulate(crsp.comp3$assetadd, ~ (.x + .y)*(11.8/12), .init = crsp.comp3$C.xsgaq[1])[-1]
#check it
> crsp.comp3
Permno Observation C.xsgaq C.xsgaq.depr assetadd desired_val
1 10026 1 45.145 44.393 0.000 44.39258
2 10026 2 45.145 43.653 0.000 43.65271
3 10026 3 45.145 42.925 0.000 42.92516
4 10026 4 96.730 92.935 51.585 92.93499
5 10026 5 96.730 91.386 0.000 91.38608
6 10026 6 96.730 89.863 0.000 89.86297
7 10026 7 145.511 136.333 48.781 136.33324
8 10026 8 145.511 134.061 0.000 134.06102
9 10026 9 145.511 131.827 0.000 131.82667
10 10026 10 190.986 174.347 45.475 174.34664
data used
crsp.comp3 <- structure(list(Permno = c(10026L, 10026L, 10026L, 10026L, 10026L,
10026L, 10026L, 10026L, 10026L, 10026L), Observation = 1:10,
C.xsgaq = c(45.145, 45.145, 45.145, 96.73, 96.73, 96.73,
145.511, 145.511, 145.511, 190.986), C.xsgaq.depr = c(44.393,
43.653, 42.925, 92.935, 91.386, 89.863, 136.333, 134.061,
131.827, 174.347)), class = "data.frame", row.names = c(NA,
-10L))
> crsp.comp3
Permno Observation C.xsgaq C.xsgaq.depr
1 10026 1 45.145 44.393
2 10026 2 45.145 43.653
3 10026 3 45.145 42.925
4 10026 4 96.730 92.935
5 10026 5 96.730 91.386
6 10026 6 96.730 89.863
7 10026 7 145.511 136.333
8 10026 8 145.511 134.061
9 10026 9 145.511 131.827
10 10026 10 190.986 174.347
Upvotes: 1
Reputation: 66425
Performance comparison: This method is about 150x faster (8.1 sec -> 0.05 sec) for 10k observations and 1,000x faster(166 sec -> 0.15 sec) for 100k observations. I expect the performance gap will get yet larger as you use larger data. See bottom for fake data used for testing.
This can handle 25M rows of fake data in about 5 seconds. If you need faster, I suggest using data.table.
Here's an alternative using dplyr
(not because it offers any particular advantages over base R here but b/c it's easier for me), relying on some algebraic manipulation. (Open to suggestions for further simplification!)
The key thing to make R fast is to structure your problem so that you can use the same calculation once across all elements of your data. That's vectorization.
library(dplyr)
crsp.comp3 %>%
# The grouping here will make it so that the first Obs
# in a new group won't "see" the last Obs of the prior group.
# We could just as easily group by Permno...
mutate(group = cumsum(Observation == 1)) %>%
group_by(group) %>%
mutate(deprec = (11.8/12) ^ Observation,
C_change = (C.xsgaq - lag(C.xsgaq, default = 0)) /
lag(deprec, default = 1), # Edit: a little faster than
# (11.8/12)^(Observation-1),
cuml = cumsum(C_change),
output = cuml * deprec) %>%
ungroup()
Result
# A tibble: 10 x 8
Permno Observation C.xsgaq group deprec C_change cuml output
<int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 10026 1 45.1 1 0.983 45.1 45.1 44.4
2 10026 2 45.1 1 0.967 0 45.1 43.7
3 10026 3 45.1 1 0.951 0 45.1 42.9
4 10026 4 96.7 1 0.935 54.3 99.4 92.9
5 10026 5 96.7 1 0.919 0 99.4 91.4
6 10026 6 96.7 1 0.904 0 99.4 89.9
7 10026 7 146. 1 0.889 54.0 153. 136.
8 10026 8 146. 1 0.874 0 153. 134.
9 10026 9 146. 1 0.860 0 153. 132.
10 10026 10 191. 1 0.845 52.9 206. 174.
Testing fake data:
n = 100000
Permno = 1000
Obs = floor(n / Permno)
crsp.comp3 <- tibble(Permno = rep(1:Permno, each = Obs),
Observation = rep(1:Obs, length.out = n),
Chg = sample(c(rep(0, 10), runif(5, 1, 100)), n, replace = TRUE)) %>%
group_by(Permno) %>%
mutate(C.xsgaq = cumsum(Chg)) %>%
ungroup() %>%
select(Permno, Observation, C.xsgaq)
Upvotes: 4