Reputation: 1245
I have a large data.frame with 350k rows. Currently I am trying to create a rolling mean of the lag_close column in the data.table. Since
I am more fluent in dplyr
than in data.table
I tried to create a rolling mean using dplyr
which shot me an error.
df = df %>% na.omit()%>% group_by(ticker) %>% mutate(avg10 = rollapplyr(lag_close, 10, mean))
Error: Column
avg10
must be length 577 (the group size) or one, not 567
My next try was to go the data.table route as I saw people having success. No error is returned, but the output is very strange and not accurate.
df = as.data.table(df)
df[, laggedRets := rollapplyr(lag_close, 10, mean, na.rm = T), by = ticker]
tail(df, 10)
ticker ret.adjusted.prices ret.closing.prices lag_close jump laggedRets
1: MELI 0.002731904 0.002731904 501.48 0.020174791 507.849
2: MELI 0.007258614 0.007258614 502.85 0.001966823 507.528
3: MELI -0.002606134 -0.002606134 506.50 -0.011437339 507.085
4: MELI 0.018132218 0.018132218 505.18 0.003832865 506.282
5: MELI -0.020628408 -0.020628408 514.34 -0.014373574 505.262
6: MELI 0.001568257 0.001568257 503.73 0.003428456 NA
7: MELI 0.018155879 0.018155879 504.52 -0.002957652 NA
8: MELI -0.022815765 -0.022815765 513.68 0.002663470 NA
9: MELI -0.007132015 -0.007132015 501.96 -0.005273231 NA
10: MELI -0.012801487 -0.012801487 498.38 -0.016467802 NA
head(df, 10)
ticker ret.adjusted.prices ret.closing.prices lag_close jump laggedRets
1: AA 0.0496011100 0.0496011100 28.83 0.006568736 NA
2: AA 0.0128883014 0.0128883014 30.26 -0.002647255 NA
3: AA 0.0009787928 0.0009787928 30.65 0.001304239 NA
4: AA -0.0391134289 -0.0391134289 30.68 0.006497716 NA
5: AA 0.0508819539 0.0508819539 29.48 0.011802531 31.154
6: AA 0.0319560684 0.0319560684 30.98 0.003222691 31.752
7: AA 0.0334689407 0.0334689407 31.97 0.031709630 32.268
8: AA -0.0009080811 -0.0009080811 33.04 -0.006680893 32.769
9: AA -0.0112086950 -0.0112086950 33.01 -0.001212398 33.330
10: AA 0.0664829064 0.0664829064 32.64 0.009452792 34.132
It seems like there are NA's on either side of the data.table, that should not be the case. Also the numbers are not accurate. I am not to sure what I am missing. Here is a sample so you guys can test it out.
df =structure(list(ticker = c("AA", "AA", "AA", "AA", "AA", "AA",
"AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA",
"AA", "AA", "AA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA",
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA",
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AAL", "AAL",
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL",
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL",
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI",
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI",
"AAOI", "AAOI", "AAOI", "AAOI", "AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV",
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV",
"ABBV", "ABBV", "ABBV", "ABBV"), lag_close = c(28.83, 30.26,
30.65, 30.68, 29.48, 30.98, 31.969999, 33.040001, 33.009998,
32.639999, 34.810001, 35.419998, 35.66, 36.290001, 37.5, 36.529999,
36.299999, 36.669998, 36.549999, 36.450001, 38.900002, 40.060001,
41.34, 41.23, 41.34, 42.299999, 42.59, 42.110001, 42.27, 41.990002,
42.029999, 42.09, 42.049999, 42.400002, 43.900002, 44.939999,
44.549999, 44.419998, 43.93, 44.07, 46.299999, 46.700001, 45.889999,
46.209999, 47.080002, 48.48, 48.639999, 48.099998, 47.650002,
46.75, 47.639999, 47.259998, 48, 46.939999, 47.540001, 47.91,
49.59, 46.950001, 44.900002, 44.25, 23.1, 23.51, 22.809999, 22.42,
22.24, 22.870001, 22.719999, 29.860001, 28.219999, 30.440001,
30.84, 30.48, 30.58, 31.290001, 32, 31.709999, 31.280001, 31.15,
30.41, 30.76, 116.150002, 116.019997, 116.610001, 117.910004,
118.989998, 119.110001, 119.75, 119.25, 119.040001, 120, 119.989998,
119.779999, 120, 120.080002, 119.970001, 121.879997, 121.940002,
121.949997, 121.629997, 121.349998, 62.41, 63.290001, 63.77,
63.790001, 64.209999, 64.07, 61.139999, 61.279999, 61.990002,
61.860001, 61.66, 61.380001, 61.150002, 60.959999, 60.580002,
61.380001, 61.27, 60, 60.43, 61.110001)), row.names = c(NA, -120L
), class = "data.frame", .Names = c("ticker", "lag_close"))
Upvotes: 1
Views: 840
Reputation: 1253
I've recently figured out a way you can perform 1 single rolling mean calculation even for an arbitrary number of groups using data.table
.
The frollmean
has an adaptive
argument that lets you supply a vector of window sizes, (provided you align "right").
This means if your data are sorted by the groups, you simply need to give it the correct window vector.
setorder(dt, ticker)
group_sizes <- dt[, .N, by = ticker]$N
window_size <- 10L
window <- pmin(sequence(group_sizes), window_size)
window[window < window_size] <- NA_integer_
library(bench)
mark(e1 = dt[, laggedRets := frollmean(lag_close, 10, na.rm = TRUE), by = ticker]$laggedRets,
e2 = dt[, laggedRets := frollmean(lag_close, n = window, na.rm = TRUE,
adaptive = TRUE)]$laggedRets)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list>
1 e1 512us 601us 1471. 131.1KB 4.35 676 2 460ms <dbl> <Rprofmem> <bench_tm>
2 e2 308us 378us 2411. 50.6KB 4.37 1103 2 458ms <dbl> <Rprofmem> <bench_tm>
I have a function roll_mean()
in my package timeplyr which does this and accepts unsorted data too.
Link: https://github.com/NicChr/timeplyr
# remotes::install_github("NicChr/timeplyr")
library(timeplyr)
dt <- data.table(g = sample.int(1e05, 1e06, TRUE),
x = rnorm(10^6))
mark(e1 = dt[, mean := roll_mean(x, n = 7, g = get("g"), partial = FALSE, na.rm = FALSE)]$mean,
e2 = dt[, mean := frollmean(x, n = 7, align = "right", na.rm = FALSE), by = "g"]$mean)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list>
1 e1 57.28ms 73.11ms 14.0 58.75MB 0 7 0 500.99ms <dbl> <Rprofmem> <bench_tm>
2 e2 4.04s 4.04s 0.248 1.54GB 1.98 1 8 4.04s <dbl> <Rprofmem> <bench_tm>
We can see that setting up the correct window vector results in a ~60x speedup and a ~25x memory reduction.
Upvotes: 0
Reputation: 11981
you could use map
from the purrr
package and apply it on 1:n()
:
df = df %>%
na.omit() %>%
group_by(ticker) %>%
mutate(avg10 = map_dbl(1:n(), ~mean(lag_close[(max(.x-9, 1)):.x], na.rm =T))
Of course you have to decide what should happen with the first 9 rows where there are fewer than 10 observations. In my solution the rows 1 to 9 contain the mean of the last 1 to 9 observations.
Upvotes: 1
Reputation: 11255
Here's the data.table()
approach with rows 1-9 being NA.
dt <- as.data.table(df)
dt[, laggedRets := frollmean(lag_close, 10, na.rm = T), by = ticker]
Also, it was unclear whether the NA
should be removed before the rolling mean. If that's the case, you would use:
dt <- as.data.table(df)
dt[!is.na(lag_close), laggedRets := frollmean(lag_close, 10), by = ticker]
Upvotes: 3