Reputation: 1573
I would like to calculate average volume for all tickers in the dataset for previous 12, 6 and 3 months from daily data. Here is the sample data:
library(BatchGetSymbols)
sample <- BatchGetSymbols(tickers = c('AAPL', 'AMZN'), first.date = Sys.Date() - 500)
sample <- sample$df.tickers
sample <- sample[, c('ticker', 'ref.date', 'volume')]
For example, if the date is 2020-05-15, I would like to calculate 3-month average volume (from 2020-02-15 to 2020-05-15). The important thing is that daily data is irregular so maybe first date is not 2020-02-15 but , for example 2020-02-14.
Upvotes: 1
Views: 46
Reputation: 161110
Here's an approach that uses the rolling-time concept.
I don't happen to have that package installed, so I'll use the ExampleData.rds
as provided in the package (which I downloaded from https://github.com/cran/BatchGetSymbols/tree/master/inst/extdata). I'll subset it to just the first two tickers. (I do this for simplicity/demonstration, not because it is required. I also ran this code on the full dataset with all 15 tickers and it took less than 0.03 seconds for the two past periods.)
The data includes just 2014 data, so I'll also reduce your periods to 1 and 3 months, realizing that you can add an arbitrary number of periods to include.
SAMP <- readRDS("ExampleData.rds")
library(data.table)
setDT(SAMP)
unique(SAMP$ticker)[1:2]
SAMP2 <- SAMP[ticker %in% unique(ticker)[1:2], .(ticker, ref.date, volume)]
SAMP2
# ticker ref.date volume
# 1: ABEV3.SA 2014-01-02 8036139
# 2: ABEV3.SA 2014-01-03 24922793
# 3: ABEV3.SA 2014-01-06 9355961
# 4: ABEV3.SA 2014-01-07 18755025
# 5: ABEV3.SA 2014-01-08 11446953
# ---
# 492: BBAS3.SA 2014-12-22 3222300
# 493: BBAS3.SA 2014-12-23 3234100
# 494: BBAS3.SA 2014-12-26 1553400
# 495: BBAS3.SA 2014-12-29 1984000
# 496: BBAS3.SA 2014-12-30 2800100
I use magrittr
just to break things out as a pipe, it is not required. I also add lubridate
in order to use add_with_rollback
(aka %m-
) in order to smartly look back n
months without day-subtractions creating odd effects.
library(magrittr) # %>%
library(lubridate) # %m-% or add_with_rollback, months
past <- c(1, 3) # change this to your c(3, 6, 12)
names(past) <- paste0("months", past)
This operates on a self-join using non-equi (or range) joins. Due to some nuances with how that is done (and the state of column names), my technique is to copy one of the reference columns (ref.date
); after a range-join, the LHS column of a range has its name retained but the values assigned from the RHS. Regardless of the coding intent of that under the hood, I find it can be confusing, so I take a short-term performance hit on duplicating one column and removing it later. Since it does not affect the other columns, it does not detract much from the overall efficiency of using data.table
(and its referential semantics).
I also create a calculated column (for each period), called past.date
. I then use the logic of past.date <= ref.date <= present
for the join (if only data.table
could infer correctly from that more-succinct depiction :-).
SAMP2[, present := ref.date]
newdats <- Map(function(nm, P) {
SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
.[, setNames(.(mean(i.volume)), nm), by = .(ticker, ref.date) ]
}, names(past), past)
SAMP2[, c("present", "past.date") := NULL ] # clean up extra columns
out <- Reduce(function(a,b) merge(a, b, by = c("ticker", "ref.date"), all.x = TRUE), newdats, init = SAMP2)
out
# ticker ref.date volume months1 months3
# 1: ABEV3.SA 2014-01-02 8036139 8036139 8036139
# 2: ABEV3.SA 2014-01-03 24922793 16479466 16479466
# 3: ABEV3.SA 2014-01-06 9355961 14104964 14104964
# 4: ABEV3.SA 2014-01-07 18755025 15267480 15267480
# 5: ABEV3.SA 2014-01-08 11446953 14503374 14503374
# ---
# 492: BBAS3.SA 2014-12-22 3222300 25233937 16780737
# 493: BBAS3.SA 2014-12-23 3234100 24233945 16626387
# 494: BBAS3.SA 2014-12-26 1553400 24653742 16681359
# 495: BBAS3.SA 2014-12-29 1984000 26420778 16543630
# 496: BBAS3.SA 2014-12-30 2800100 25239744 16368576
To prove that that's doing what we expect, I'll demonstrate two things:
I'll interrupt the Map
on the first iteration, looking at the data immediately after the self-join (SAMP2[., on=.(...)]
) but before the summarization. (I'll also order
it to highlight what is present. This step is not required for proper calc/join.)
SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ]
# ticker ref.date volume present past.date
# 1: ABEV3.SA 2014-01-02 8036139 2014-01-02 2013-12-02
# 2: ABEV3.SA 2014-01-03 24922793 2014-01-03 2013-12-03
# 3: ABEV3.SA 2014-01-06 9355961 2014-01-06 2013-12-06
# 4: ABEV3.SA 2014-01-07 18755025 2014-01-07 2013-12-07
# 5: ABEV3.SA 2014-01-08 11446953 2014-01-08 2013-12-08
# ---
# 492: BBAS3.SA 2014-12-22 3222300 2014-12-22 2014-11-22
# 493: BBAS3.SA 2014-12-23 3234100 2014-12-23 2014-11-23
# 494: BBAS3.SA 2014-12-26 1553400 2014-12-26 2014-11-26
# 495: BBAS3.SA 2014-12-29 1984000 2014-12-29 2014-11-29
# 496: BBAS3.SA 2014-12-30 2800100 2014-12-30 2014-11-30
SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
.[ order(ticker, ref.date), ]
# ticker ref.date volume present past.date i.volume i.present i.past.date
# 1: ABEV3.SA 2014-01-02 8036139 2014-01-02 2014-01-02 8036139 2014-01-02 2013-12-02
# 2: ABEV3.SA 2014-01-03 24922793 2014-01-02 2014-01-02 8036139 2014-01-02 2013-12-02
# 3: ABEV3.SA 2014-01-03 24922793 2014-01-03 2014-01-03 24922793 2014-01-03 2013-12-03
# 4: ABEV3.SA 2014-01-06 9355961 2014-01-02 2014-01-02 8036139 2014-01-02 2013-12-02
# 5: ABEV3.SA 2014-01-06 9355961 2014-01-03 2014-01-03 24922793 2014-01-03 2013-12-03
# ---
# 10326: BBAS3.SA 2014-12-30 2800100 2014-12-22 2014-12-22 3222300 2014-12-22 2014-11-22
# 10327: BBAS3.SA 2014-12-30 2800100 2014-12-23 2014-12-23 3234100 2014-12-23 2014-11-23
# 10328: BBAS3.SA 2014-12-30 2800100 2014-12-26 2014-12-26 1553400 2014-12-26 2014-11-26
# 10329: BBAS3.SA 2014-12-30 2800100 2014-12-29 2014-12-29 1984000 2014-12-29 2014-11-29
# 10330: BBAS3.SA 2014-12-30 2800100 2014-12-30 2014-12-30 2800100 2014-12-30 2014-11-30
Note that the first ref.date
(2014-01-02
) has 1 occurrence (no surprise, no data before 2014 in this set), 2014-01-03
has two rows (02
and 03
), etc.
Breaking out of that, I'll change the flow to add the length of data used for each aggregation.
newdats <- Map(function(nm, P) {
SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
.[, setNames(.(mean(i.volume), .N), c(nm, paste0(nm, "_n"))), by = .(ticker, ref.date) ]
}, names(past), past)
SAMP2[, c("present", "past.date") := NULL ]
out <- Reduce(function(a,b) merge(a, b, by = c("ticker", "ref.date"), all.x = TRUE), newdats, init = SAMP2)
out
# ticker ref.date volume months1 months1_n months3 months3_n
# 1: ABEV3.SA 2014-01-02 8036139 8036139 1 8036139 1
# 2: ABEV3.SA 2014-01-03 24922793 16479466 2 16479466 2
# 3: ABEV3.SA 2014-01-06 9355961 14104964 3 14104964 3
# 4: ABEV3.SA 2014-01-07 18755025 15267480 4 15267480 4
# 5: ABEV3.SA 2014-01-08 11446953 14503374 5 14503374 5
# ---
# 492: BBAS3.SA 2014-12-22 3222300 25233937 21 16780737 65
# 493: BBAS3.SA 2014-12-23 3234100 24233945 22 16626387 65
# 494: BBAS3.SA 2014-12-26 1553400 24653742 21 16681359 63
# 495: BBAS3.SA 2014-12-29 1984000 26420778 19 16543630 63
# 496: BBAS3.SA 2014-12-30 2800100 25239744 20 16368576 63
We can see that the first observation has 1 row feeding it, and the last (of the second ticker) row has 20 rows feeding it. This highlights that the sample data provided does not have the entire month.
(I suggest that the mean
without context of n
might be misleading, so perhaps this has meaning.)
One note: in general, range-joins in data.table
are done as efficiently, but if done sloppily they can explode the memory usage. For instance, if instead of months you chose years, the self-joins would be closer to a ticker-based cartesian join. With smaller datasets even this might not be a problem. With many more tickers and much larger data, and with infinite memory and time, this problem could be reduced to a simple (non-range) cartesian join on ticker
, filter out the appropriate dates (based on past.date
, ref.date
, and present
), and then summarize.
Upvotes: 1
Reputation: 24888
Here's an approach with dpyr
and purrr
:
library(dplyr)
library(purrr)
sample %>%
group_by(ticker) %>%
summarise(map_dfr(setNames(c(3,6,12),c("3month","6month","12month")),
~ mean(volume[ref.date > (Sys.Date() - 30 * .x)])))
ticker `3month` `6month` `12month`
<chr> <dbl> <dbl> <dbl>
1 AAPL 114310505 144409320. 152001362.
2 AMZN 4421073. 4711179. 4966227.
Or with data.table
:
library(data.table)
setDT(sample)
sample[,lapply(setNames(c(3,6,12),c("3month","6month","12month")),
function(x)mean(volume[ref.date > (Sys.Date() - 30 * x)])),by = ticker]
ticker 3month 6month 12month
1: AAPL 114310505 144409320 152001362
2: AMZN 4421073 4711179 4966227
Upvotes: 1