Jordan Wrong
Jordan Wrong

Reputation: 1245

Rolling Mean By Group Dplyr/data.table

I have a large data.frame with 350k rows. Currently I am trying to create a rolling mean of the lag_close column in the data.table. Since I am more fluent in dplyr than in data.table I tried to create a rolling mean using dplyr which shot me an error.

df = df %>% na.omit()%>%  group_by(ticker) %>% mutate(avg10 = rollapplyr(lag_close, 10, mean))

Error: Column avg10 must be length 577 (the group size) or one, not 567

My next try was to go the data.table route as I saw people having success. No error is returned, but the output is very strange and not accurate.

df = as.data.table(df)
df[, laggedRets := rollapplyr(lag_close, 10, mean, na.rm = T), by = ticker]
tail(df, 10)
ticker ret.adjusted.prices ret.closing.prices lag_close         jump laggedRets
 1:   MELI         0.002731904        0.002731904    501.48  0.020174791    507.849
 2:   MELI         0.007258614        0.007258614    502.85  0.001966823    507.528
 3:   MELI        -0.002606134       -0.002606134    506.50 -0.011437339    507.085
 4:   MELI         0.018132218        0.018132218    505.18  0.003832865    506.282
 5:   MELI        -0.020628408       -0.020628408    514.34 -0.014373574    505.262
 6:   MELI         0.001568257        0.001568257    503.73  0.003428456         NA
 7:   MELI         0.018155879        0.018155879    504.52 -0.002957652         NA
 8:   MELI        -0.022815765       -0.022815765    513.68  0.002663470         NA
 9:   MELI        -0.007132015       -0.007132015    501.96 -0.005273231         NA
10:   MELI        -0.012801487       -0.012801487    498.38 -0.016467802         NA

head(df, 10)
  ticker ret.adjusted.prices ret.closing.prices lag_close         jump laggedRets
 1:     AA        0.0496011100       0.0496011100     28.83  0.006568736         NA
 2:     AA        0.0128883014       0.0128883014     30.26 -0.002647255         NA
 3:     AA        0.0009787928       0.0009787928     30.65  0.001304239         NA
 4:     AA       -0.0391134289      -0.0391134289     30.68  0.006497716         NA
 5:     AA        0.0508819539       0.0508819539     29.48  0.011802531     31.154
 6:     AA        0.0319560684       0.0319560684     30.98  0.003222691     31.752
 7:     AA        0.0334689407       0.0334689407     31.97  0.031709630     32.268
 8:     AA       -0.0009080811      -0.0009080811     33.04 -0.006680893     32.769
 9:     AA       -0.0112086950      -0.0112086950     33.01 -0.001212398     33.330
10:     AA        0.0664829064       0.0664829064     32.64  0.009452792     34.132

It seems like there are NA's on either side of the data.table, that should not be the case. Also the numbers are not accurate. I am not to sure what I am missing. Here is a sample so you guys can test it out.

df =structure(list(ticker = c("AA", "AA", "AA", "AA", "AA", "AA", 
"AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", 
"AA", "AA", "AA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", 
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", 
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AAL", "AAL", 
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", 
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", 
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", 
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", 
"AAOI", "AAOI", "AAOI", "AAOI", "AAPL", "AAPL", "AAPL", "AAPL", 
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", 
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", 
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", 
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", 
"ABBV", "ABBV", "ABBV", "ABBV"), lag_close = c(28.83, 30.26, 
30.65, 30.68, 29.48, 30.98, 31.969999, 33.040001, 33.009998, 
32.639999, 34.810001, 35.419998, 35.66, 36.290001, 37.5, 36.529999, 
36.299999, 36.669998, 36.549999, 36.450001, 38.900002, 40.060001, 
41.34, 41.23, 41.34, 42.299999, 42.59, 42.110001, 42.27, 41.990002, 
42.029999, 42.09, 42.049999, 42.400002, 43.900002, 44.939999, 
44.549999, 44.419998, 43.93, 44.07, 46.299999, 46.700001, 45.889999, 
46.209999, 47.080002, 48.48, 48.639999, 48.099998, 47.650002, 
46.75, 47.639999, 47.259998, 48, 46.939999, 47.540001, 47.91, 
49.59, 46.950001, 44.900002, 44.25, 23.1, 23.51, 22.809999, 22.42, 
22.24, 22.870001, 22.719999, 29.860001, 28.219999, 30.440001, 
30.84, 30.48, 30.58, 31.290001, 32, 31.709999, 31.280001, 31.15, 
30.41, 30.76, 116.150002, 116.019997, 116.610001, 117.910004, 
118.989998, 119.110001, 119.75, 119.25, 119.040001, 120, 119.989998, 
119.779999, 120, 120.080002, 119.970001, 121.879997, 121.940002, 
121.949997, 121.629997, 121.349998, 62.41, 63.290001, 63.77, 
63.790001, 64.209999, 64.07, 61.139999, 61.279999, 61.990002, 
61.860001, 61.66, 61.380001, 61.150002, 60.959999, 60.580002, 
61.380001, 61.27, 60, 60.43, 61.110001)), row.names = c(NA, -120L
), class = "data.frame", .Names = c("ticker", "lag_close"))

Upvotes: 1

Views: 840

Answers (3)

NicChr
NicChr

Reputation: 1253

I've recently figured out a way you can perform 1 single rolling mean calculation even for an arbitrary number of groups using data.table.

The frollmean has an adaptive argument that lets you supply a vector of window sizes, (provided you align "right").

This means if your data are sorted by the groups, you simply need to give it the correct window vector.

setorder(dt, ticker)
group_sizes <- dt[, .N, by = ticker]$N
window_size <- 10L
window <- pmin(sequence(group_sizes), window_size)
window[window < window_size] <- NA_integer_

library(bench)
mark(e1 = dt[, laggedRets := frollmean(lag_close, 10, na.rm = TRUE), by = ticker]$laggedRets,
     e2 = dt[, laggedRets := frollmean(lag_close, n = window, na.rm = TRUE, 
                                       adaptive = TRUE)]$laggedRets)
# A tibble: 2 x 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time      
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>    
1 e1            512us  601us     1471.   131.1KB     4.35   676     2      460ms <dbl>  <Rprofmem> <bench_tm>
2 e2            308us  378us     2411.    50.6KB     4.37  1103     2      458ms <dbl>  <Rprofmem> <bench_tm>

I have a function roll_mean() in my package timeplyr which does this and accepts unsorted data too.

Link: https://github.com/NicChr/timeplyr

Benchmark with lots of groups

# remotes::install_github("NicChr/timeplyr")
library(timeplyr)
dt <- data.table(g = sample.int(1e05, 1e06, TRUE),
                 x = rnorm(10^6))
mark(e1 = dt[, mean := roll_mean(x, n = 7, g = get("g"), partial = FALSE, na.rm = FALSE)]$mean,
     e2 = dt[, mean := frollmean(x, n = 7, align = "right", na.rm = FALSE), by = "g"]$mean)
# A tibble: 2 x 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time      
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>    
1 e1         57.28ms 73.11ms    14.0     58.75MB     0        7     0   500.99ms <dbl>  <Rprofmem> <bench_tm>
2 e2           4.04s   4.04s     0.248    1.54GB     1.98     1     8      4.04s <dbl>  <Rprofmem> <bench_tm>

We can see that setting up the correct window vector results in a ~60x speedup and a ~25x memory reduction.

Upvotes: 0

Cettt
Cettt

Reputation: 11981

you could use map from the purrr package and apply it on 1:n():

df = df %>% 
  na.omit() %>% 
  group_by(ticker) %>% 
  mutate(avg10 = map_dbl(1:n(), ~mean(lag_close[(max(.x-9, 1)):.x], na.rm =T))

Of course you have to decide what should happen with the first 9 rows where there are fewer than 10 observations. In my solution the rows 1 to 9 contain the mean of the last 1 to 9 observations.

Upvotes: 1

Cole
Cole

Reputation: 11255

Here's the data.table() approach with rows 1-9 being NA.

dt <- as.data.table(df)
dt[, laggedRets := frollmean(lag_close, 10, na.rm = T), by = ticker]

Also, it was unclear whether the NA should be removed before the rolling mean. If that's the case, you would use:

dt <- as.data.table(df)
dt[!is.na(lag_close), laggedRets := frollmean(lag_close, 10), by = ticker]

Upvotes: 3

Related Questions