Mean of last year (halfyear, month) of observations using.data.table

Question

I would like to calculate average volume for all tickers in the dataset for previous 12, 6 and 3 months from daily data. Here is the sample data:

library(BatchGetSymbols)
sample <- BatchGetSymbols(tickers = c('AAPL', 'AMZN'), first.date = Sys.Date() - 500)
sample <- sample$df.tickers
sample <- sample[, c('ticker', 'ref.date', 'volume')]

For example, if the date is 2020-05-15, I would like to calculate 3-month average volume (from 2020-02-15 to 2020-05-15). The important thing is that daily data is irregular so maybe first date is not 2020-02-15 but , for example 2020-02-14.

r2evans · Accepted Answer

Here's an approach that uses the rolling-time concept.

I don't happen to have that package installed, so I'll use the ExampleData.rds as provided in the package (which I downloaded from https://github.com/cran/BatchGetSymbols/tree/master/inst/extdata). I'll subset it to just the first two tickers. (I do this for simplicity/demonstration, not because it is required. I also ran this code on the full dataset with all 15 tickers and it took less than 0.03 seconds for the two past periods.)

The data includes just 2014 data, so I'll also reduce your periods to 1 and 3 months, realizing that you can add an arbitrary number of periods to include.

SAMP <- readRDS("ExampleData.rds")
library(data.table)
setDT(SAMP)
unique(SAMP$ticker)[1:2]
SAMP2 <- SAMP[ticker %in% unique(ticker)[1:2], .(ticker, ref.date, volume)]
SAMP2
#        ticker   ref.date   volume
#   1: ABEV3.SA 2014-01-02  8036139
#   2: ABEV3.SA 2014-01-03 24922793
#   3: ABEV3.SA 2014-01-06  9355961
#   4: ABEV3.SA 2014-01-07 18755025
#   5: ABEV3.SA 2014-01-08 11446953
#  ---                             
# 492: BBAS3.SA 2014-12-22  3222300
# 493: BBAS3.SA 2014-12-23  3234100
# 494: BBAS3.SA 2014-12-26  1553400
# 495: BBAS3.SA 2014-12-29  1984000
# 496: BBAS3.SA 2014-12-30  2800100

I use magrittr just to break things out as a pipe, it is not required. I also add lubridate in order to use add_with_rollback (aka %m-) in order to smartly look back n months without day-subtractions creating odd effects.

library(magrittr)                      # %>%
library(lubridate)                     # %m-% or add_with_rollback, months
past <- c(1, 3)                        # change this to your c(3, 6, 12)
names(past) <- paste0("months", past)

This operates on a self-join using non-equi (or range) joins. Due to some nuances with how that is done (and the state of column names), my technique is to copy one of the reference columns (ref.date); after a range-join, the LHS column of a range has its name retained but the values assigned from the RHS. Regardless of the coding intent of that under the hood, I find it can be confusing, so I take a short-term performance hit on duplicating one column and removing it later. Since it does not affect the other columns, it does not detract much from the overall efficiency of using data.table (and its referential semantics).

I also create a calculated column (for each period), called past.date. I then use the logic of past.date <= ref.date <= present for the join (if only data.table could infer correctly from that more-succinct depiction :-).

SAMP2[, present := ref.date]
newdats <- Map(function(nm, P) {
  SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
    SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
    .[, setNames(.(mean(i.volume)), nm), by = .(ticker, ref.date) ]
}, names(past), past)
SAMP2[, c("present", "past.date") := NULL ] # clean up extra columns
out <- Reduce(function(a,b) merge(a, b, by = c("ticker", "ref.date"), all.x = TRUE), newdats, init = SAMP2)
out
#        ticker   ref.date   volume  months1  months3
#   1: ABEV3.SA 2014-01-02  8036139  8036139  8036139
#   2: ABEV3.SA 2014-01-03 24922793 16479466 16479466
#   3: ABEV3.SA 2014-01-06  9355961 14104964 14104964
#   4: ABEV3.SA 2014-01-07 18755025 15267480 15267480
#   5: ABEV3.SA 2014-01-08 11446953 14503374 14503374
#  ---                                               
# 492: BBAS3.SA 2014-12-22  3222300 25233937 16780737
# 493: BBAS3.SA 2014-12-23  3234100 24233945 16626387
# 494: BBAS3.SA 2014-12-26  1553400 24653742 16681359
# 495: BBAS3.SA 2014-12-29  1984000 26420778 16543630
# 496: BBAS3.SA 2014-12-30  2800100 25239744 16368576

To prove that that's doing what we expect, I'll demonstrate two things:

I'll interrupt the Map on the first iteration, looking at the data immediately after the self-join (SAMP2[., on=.(...)]) but before the summarization. (I'll also order it to highlight what is present. This step is not required for proper calc/join.)

SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ]
#        ticker   ref.date   volume    present  past.date
#   1: ABEV3.SA 2014-01-02  8036139 2014-01-02 2013-12-02
#   2: ABEV3.SA 2014-01-03 24922793 2014-01-03 2013-12-03
#   3: ABEV3.SA 2014-01-06  9355961 2014-01-06 2013-12-06
#   4: ABEV3.SA 2014-01-07 18755025 2014-01-07 2013-12-07
#   5: ABEV3.SA 2014-01-08 11446953 2014-01-08 2013-12-08
#  ---                                                   
# 492: BBAS3.SA 2014-12-22  3222300 2014-12-22 2014-11-22
# 493: BBAS3.SA 2014-12-23  3234100 2014-12-23 2014-11-23
# 494: BBAS3.SA 2014-12-26  1553400 2014-12-26 2014-11-26
# 495: BBAS3.SA 2014-12-29  1984000 2014-12-29 2014-11-29
# 496: BBAS3.SA 2014-12-30  2800100 2014-12-30 2014-11-30

SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
    SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
    .[ order(ticker, ref.date), ]
#          ticker   ref.date   volume    present  past.date i.volume  i.present i.past.date
#     1: ABEV3.SA 2014-01-02  8036139 2014-01-02 2014-01-02  8036139 2014-01-02  2013-12-02
#     2: ABEV3.SA 2014-01-03 24922793 2014-01-02 2014-01-02  8036139 2014-01-02  2013-12-02
#     3: ABEV3.SA 2014-01-03 24922793 2014-01-03 2014-01-03 24922793 2014-01-03  2013-12-03
#     4: ABEV3.SA 2014-01-06  9355961 2014-01-02 2014-01-02  8036139 2014-01-02  2013-12-02
#     5: ABEV3.SA 2014-01-06  9355961 2014-01-03 2014-01-03 24922793 2014-01-03  2013-12-03
#    ---                                                                                   
# 10326: BBAS3.SA 2014-12-30  2800100 2014-12-22 2014-12-22  3222300 2014-12-22  2014-11-22
# 10327: BBAS3.SA 2014-12-30  2800100 2014-12-23 2014-12-23  3234100 2014-12-23  2014-11-23
# 10328: BBAS3.SA 2014-12-30  2800100 2014-12-26 2014-12-26  1553400 2014-12-26  2014-11-26
# 10329: BBAS3.SA 2014-12-30  2800100 2014-12-29 2014-12-29  1984000 2014-12-29  2014-11-29
# 10330: BBAS3.SA 2014-12-30  2800100 2014-12-30 2014-12-30  2800100 2014-12-30  2014-11-30

Note that the first ref.date (2014-01-02) has 1 occurrence (no surprise, no data before 2014 in this set), 2014-01-03 has two rows (02 and 03), etc.

Breaking out of that, I'll change the flow to add the length of data used for each aggregation.

newdats <- Map(function(nm, P) {
  SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
    SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
    .[, setNames(.(mean(i.volume), .N), c(nm, paste0(nm, "_n"))), by = .(ticker, ref.date) ]
}, names(past), past)
SAMP2[, c("present", "past.date") := NULL ]
out <- Reduce(function(a,b) merge(a, b, by = c("ticker", "ref.date"), all.x = TRUE), newdats, init = SAMP2)
out
#        ticker   ref.date   volume  months1 months1_n  months3 months3_n
#   1: ABEV3.SA 2014-01-02  8036139  8036139         1  8036139         1
#   2: ABEV3.SA 2014-01-03 24922793 16479466         2 16479466         2
#   3: ABEV3.SA 2014-01-06  9355961 14104964         3 14104964         3
#   4: ABEV3.SA 2014-01-07 18755025 15267480         4 15267480         4
#   5: ABEV3.SA 2014-01-08 11446953 14503374         5 14503374         5
#  ---                                                                   
# 492: BBAS3.SA 2014-12-22  3222300 25233937        21 16780737        65
# 493: BBAS3.SA 2014-12-23  3234100 24233945        22 16626387        65
# 494: BBAS3.SA 2014-12-26  1553400 24653742        21 16681359        63
# 495: BBAS3.SA 2014-12-29  1984000 26420778        19 16543630        63
# 496: BBAS3.SA 2014-12-30  2800100 25239744        20 16368576        63

We can see that the first observation has 1 row feeding it, and the last (of the second ticker) row has 20 rows feeding it. This highlights that the sample data provided does not have the entire month.

(I suggest that the mean without context of n might be misleading, so perhaps this has meaning.)

One note: in general, range-joins in data.table are done as efficiently, but if done sloppily they can explode the memory usage. For instance, if instead of months you chose years, the self-joins would be closer to a ticker-based cartesian join. With smaller datasets even this might not be a problem. With many more tickers and much larger data, and with infinite memory and time, this problem could be reduced to a simple (non-range) cartesian join on ticker, filter out the appropriate dates (based on past.date, ref.date, and present), and then summarize.

Mean of last year (halfyear, month) of observations using.data.table

Answers (2)

Related Questions