Reputation: 1472
I'm trying to roll my function through data.table by group and run into problems. Not sure should I change my function or is my call wrong. Here is simple example:
Data
test <- data.table(return=c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
my function
zoo_fun <- function(dt, N) {
(rollapply(dt$return + 1, N, FUN=prod, fill=NA, align='right') - 1)
}
Running it (I want to create new column momentum, which would be just product of latest 3 observations added by one for each security (so grouping by=sec).
test[, momentum3 := zoo_fun(test, 3), by=sec]
Warning messages:
1: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
RHS 1 is length 10 (greater than the size (5) of group 1). The last 5 element(s) will be discarded.
2: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
RHS 1 is length 10 (greater than the size (5) of group 2). The last 5 element(s) will be discarded.
I get that warning and result is not expected:
> test
return sec momentum3
1: 0.1 A NA
2: 0.1 A NA
3: 0.1 A 0.331
4: 0.1 A 0.331
5: 0.1 A 0.331
6: 0.2 B NA
7: 0.2 B NA
8: 0.2 B 0.331
9: 0.2 B 0.331
10: 0.2 B 0.331
I was expecting B sec to be filled with 0.728 ((1.2*1.2*1.2) -1) with two NAs in start. What am I doing wrong? Is it that rolling functions won't work with grouping?
Upvotes: 1
Views: 2934
Reputation: 42592
This answer suggested to use reduce()
and shift()
for rolling window problems with data.table
. This benchmark showed that this might be considerably faster than zoo::rollapply()
.
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
# return sec momentum
# 1: 0.1 A NA
# 2: 0.1 A NA
# 3: 0.1 A 0.331
# 4: 0.1 A 0.331
# 5: 0.1 A 0.331
# 6: 0.2 B NA
# 7: 0.2 B NA
# 8: 0.2 B 0.728
# 9: 0.2 B 0.728
#10: 0.2 B 0.728
microbenchmark::microbenchmark(
zoo = test[, momentum := zoo_fun(return, 3), by = sec][],
red = test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][],
times = 100L
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# zoo 2318.209 2389.131 2445.1707 2421.541 2466.1930 3108.382 100 b
# red 562.465 625.413 663.4893 646.880 673.4715 1094.771 100 a
To verify the benchmark results with the small data set, a larger data set is constructed:
n_rows <- 1e4
test0 <- data.table(return = rep(as.vector(outer(1:5/100, 1:2/10, "+")), n_rows),
sec = rep(rep(c("A", "B"), each = 5L), n_rows))
test0
# return sec
# 1: 0.11 A
# 2: 0.12 A
# 3: 0.13 A
# 4: 0.14 A
# 5: 0.15 A
# ---
# 99996: 0.21 B
# 99997: 0.22 B
# 99998: 0.23 B
# 99999: 0.24 B
#100000: 0.25 B
As test
is being modified in place, each benchmark run is started with a fresh copy of test0
.
microbenchmark::microbenchmark(
copy = test <- copy(test0),
zoo = {
test <- copy(test0)
test[, momentum := zoo_fun(return, 3), by = sec][]
},
red = {
test <- copy(test0)
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
},
times = 10L
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# copy 282.619 294.512 325.3261 298.424 350.272 414.983 10 a
# zoo 1129601.974 1144346.463 1188484.0653 1162598.499 1194430.395 1337727.279 10 b
# red 3354.554 3439.095 6135.8794 5002.008 7695.948 11443.595 10 a
For 100k rows, the Reduce()
/ shift()
approach is more than 200 times faster than the zoo::rollapply()
.
Apparently, there are different interpretations of what the expected result is.
To investigate this, a modified data set is used:
test <- data.table(return=c(0.1, 0.11, 0.12, 0.13, 0.14, 0.21, 0.22, 0.23, 0.24, 0.25),
sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
test
# return sec
# 1: 0.10 A
# 2: 0.11 A
# 3: 0.12 A
# 4: 0.13 A
# 5: 0.14 A
# 6: 0.21 B
# 7: 0.22 B
# 8: 0.23 B
# 9: 0.24 B
#10: 0.25 B
Note that the return
values within in each group are varying which is different to the OP's data set where the return
values for each sec
group are constant.
With this, the accepted answer (rollapply()
) returns
test[, momentum := zoo_fun(return, 3), by = sec][]
# return sec momentum
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.367520
# 4: 0.13 A 0.404816
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.815726
# 9: 0.24 B 0.860744
#10: 0.25 B 0.906500
Henrik's answer returns:
test[test[ , tail(.I, 3), by = sec]$V1, res := prod(return + 1) - 1, by = sec][]
# return sec res
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.442784
# 4: 0.13 A 0.442784
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.906500
# 9: 0.24 B 0.906500
#10: 0.25 B 0.906500
The Reduce()
/shift()
solution returns:
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
# return sec momentum
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.367520
# 4: 0.13 A 0.404816
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.815726
# 9: 0.24 B 0.860744
#10: 0.25 B 0.906500
Upvotes: 5
Reputation: 37889
When you use dt$return
the whole data.table
is picked internally within the groups. Just use the column you need in the function definition and it will work fine:
#use the column instead of the data.table
zoo_fun <- function(column, N) {
(rollapply(column + 1, N, FUN=prod, fill=NA, align='right') - 1)
}
#now it works fine
test[, momentum := zoo_fun(return, 3), by = sec]
As a separate note, you should probably not use return
as a column or variable name.
Out:
> test
return sec momentum
1: 0.1 A NA
2: 0.1 A NA
3: 0.1 A 0.331
4: 0.1 A 0.331
5: 0.1 A 0.331
6: 0.2 B NA
7: 0.2 B NA
8: 0.2 B 0.728
9: 0.2 B 0.728
10: 0.2 B 0.728
Upvotes: 3