Ayoze Alfageme
Ayoze Alfageme

Reputation: 311

Average Annual Growth Rate with moving length R

Given the structure of my data below, I would like to calculate the forward five years average annual growth rate (where the first year is lead(markup) of the column markup for each individual (individual are identified in the column (gvkey)), and add to the data frame that average as a column. However, some individuals have less than five years observations, and for all individuals', their last 4 years observations, have less than 5 years of observations ahead of them. For those cases, the average annual growth rate should adjust to the number of observation ahead of them (with a maximum of 5).

dput(example)
structure(list(gvkey = c(1001L, 1001L, 1001L, 1003L, 1003L, 1003L, 
1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L), fyear = c(1983L, 1984L, 
1985L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1980L, 
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 
1999L), markup = c(3.02456418383518, 2.91714600416106, 2.97620103473762, 
0.628645648836935, 0.538264738598443, 0.74536402337831, 0.89905329776662, 
0.571759161863088, 0.510497237569061, 0.621391904401246, 0.320146680750145, 
0.277978758953348, 0.31442332968701, 0.319433516915814, 0.324865816687745, 
0.335264348013352, 0.328048313395744, 0.326632245360565, 0.340874293859881, 
0.320374201245953, 0.27456562124358, 0.276693369097675, 0.245072145096866, 
0.241026046834387, 0.242841330851661, 0.249635000371186, 0.257903948772679, 
0.262641379065405, 0.261534064206543, 0.22953354130982)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
    gvkey = c(1001L, 1003L, 1004L), .rows = structure(list(1:3, 
        4:10, 11:30), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))

This is as far as I got:

example %>%
  filter(fyear %in% 1980:2019)%>%
  group_by((gvkey))%>%
  mutate(markupchange = ((((lead(markup)-markup)/markup)+(lead(markup, K =2)-lead(markup)/lead(markup))+(lead(markup, K =3)-lead(markup, k =2)/lead(markup, K=2))+(lead(markup, K =4)-lead(markup, k =3)/lead(markup, K=3))+(lead(markup, K =5)-lead(markup, k =4)/lead(markup, K=4))/5)))

What I can't figure out how to indicate is to shorten the lenght of the average annual growth rate for thoses cases with less than 5 observations ahead.

As an ouput I would like to get back the same data frame with the extra column for the average annual growth rate of the markup. The value in row 1 of the added column should be -0,00628231878798876 and in the second row 0,020547945. Many thanks for any tips.

Upvotes: 1

Views: 184

Answers (2)

NicChr
NicChr

Reputation: 1253

Here's a solution using frollmean() which might be more computationally efficient.

First we need a helper to create rolling partial windows for use in frollmean(). We'll use it to set bounds on the window sizes based on the group sizes.

# Get rolling window sizes, including partial
window_seq <- function(k, n, partial = TRUE){
  if (length(k) != 1L) stop("k must be of length 1.")
  if (length(n) != 1L) stop("n must be of length 1.")
  if (n > .Machine[["integer.max"]]){
    stop("n must not be greater than .Machine$integer.max")
  }
  n <- as.integer(n)
  k <- as.integer(k)
  k <- min(k, n) # Bound k to <= n
  k <- max(k, 0L) # Bound k to >= 0
  pk <- max(k - 1L, 0L) # Partial k, bounded to >= 0
  p_seq <- seq_len(pk) # Partial window sequence
  out <- rep_len(k, n)
  # Replace partial part with partial sequence
  if (partial){
    out[p_seq] <- p_seq
  } else {
    out[p_seq] <- NA_integer_
  }
  out
}

data.table

library(data.table)

# data.table
example2 <- copy(example)
setDT(example2)

example2[, pchange := (markup/shift(markup)) - 1,
         by = gvkey]
example2[, row_id := rowid(gvkey)]
setorderv(example2, c("gvkey", "row_id"), order = c(1L, -1L))
example2[, avg5 := frollmean(shift(pchange), n = window_seq(5, .N),
                             adaptive = T, na.rm = T),
         by = gvkey]
setorderv(example2, c("gvkey", "row_id"))
print(example2, n = 10)
#>     gvkey fyear    markup      pchange row_id         avg5
#>  1:  1001  1983 3.0245642           NA      1 -0.007635573
#>  2:  1001  1984 2.9171460 -0.035515259      2  0.020244112
#>  3:  1001  1985 2.9762010  0.020244112      3          NaN
#>  4:  1003  1983 0.6286456           NA      1 -0.004802629
#>  5:  1003  1984 0.5382647 -0.143770836      2  0.067397285
#> ---                                                       
#> 26:  1004  1995 0.2496350  0.027975755     16 -0.018769986
#> 27:  1004  1996 0.2579039  0.033124155     17 -0.036068033
#> 28:  1004  1997 0.2626414  0.018368972     18 -0.063286535
#> 29:  1004  1998 0.2615341 -0.004216072     19 -0.122356998
#> 30:  1004  1999 0.2295335 -0.122356998     20          NaN

dplyr

library(dplyr)
example %>%
  group_by(gvkey) %>%
  mutate(pchange = (markup/lag(markup)) - 1) %>%
  mutate(row_id = row_number()) %>%
  arrange(desc(row_id), .by_group = TRUE) %>%
  mutate(avg5 = frollmean(lag(pchange), n = window_seq(5, n()), 
                          adaptive = T, na.rm = T)) %>%
  arrange(row_id, .by_group = TRUE)
#> # A tibble: 30 x 6
#> # Groups:   gvkey [3]
#>    gvkey fyear markup pchange row_id      avg5
#>    <int> <int>  <dbl>   <dbl>  <int>     <dbl>
#>  1  1001  1983  3.02  NA           1  -0.00764
#>  2  1001  1984  2.92  -0.0355      2   0.0202 
#>  3  1001  1985  2.98   0.0202      3 NaN      
#>  4  1003  1983  0.629 NA           1  -0.00480
#>  5  1003  1984  0.538 -0.144       2   0.0674 
#>  6  1003  1985  0.745  0.385       3  -0.0119 
#>  7  1003  1986  0.899  0.206       4  -0.0847 
#>  8  1003  1987  0.572 -0.364       5   0.0550 
#>  9  1003  1988  0.510 -0.107       6   0.217  
#> 10  1003  1989  0.621  0.217       7 NaN      
#> # i 20 more rows

For actual "growth rates", one can use geometric mean of percent changes which gives you the expected percent change per unit time.

# Growth rates
example %>%
  group_by(gvkey) %>%
  mutate(pchange = (markup/lag(markup))) %>%
  mutate(row_id = row_number()) %>%
  arrange(desc(row_id), .by_group = TRUE) %>%
  mutate(growth_rate5 = exp(frollmean(log(lag(pchange)), n = window_seq(5, n()),
                                      adaptive = T, na.rm = T)) - 1) %>%
  arrange(row_id, .by_group = TRUE)
#> # A tibble: 30 x 6
#> # Groups:   gvkey [3]
#>    gvkey fyear markup pchange row_id growth_rate5
#>    <int> <int>  <dbl>   <dbl>  <int>        <dbl>
#>  1  1001  1983  3.02   NA          1     -0.00803
#>  2  1001  1984  2.92    0.964      2      0.0202 
#>  3  1001  1985  2.98    1.02       3    NaN      
#>  4  1003  1983  0.629  NA          1     -0.0408 
#>  5  1003  1984  0.538   0.856      2      0.0291 
#>  6  1003  1985  0.745   1.38       3     -0.0445 
#>  7  1003  1986  0.899   1.21       4     -0.116  
#>  8  1003  1987  0.572   0.636      5      0.0425 
#>  9  1003  1988  0.510   0.893      6      0.217  
#> 10  1003  1989  0.621   1.22       7    NaN      
#> # i 20 more rows

Upvotes: 0

r2evans
r2evans

Reputation: 160417

Since your system is complaining about the size of your data, I'm going to recommend a non-join method (see the answer history for the first cut on this).

  1. It is a bespoke function for each group, so it internally determines which data and the mean;
  2. If necessary (based on memory complaints), I will suggest a data.table variant. I'll also provide the dplyr version that works with this data, but it may not work with your larger data, in which case data.table (with its in-place referential semantics) will be preferred.

First, a quick helper-function:

fun <- function(mkup, fyr, span=5) {
  sapply(fyr, function(yr) {
    val <- mkup[between(fyr, yr, yr+span)]
    mean(c(diff(val), NA) / val, na.rm = TRUE)
  })
}

noting that between can be from either dplyr or data.table, they are equivalent in this use.

### data.table
library(data.table)
EX <- as.data.table(example)
EX[, avg5 := fun(markup, fyear), by = gvkey]
### (same result as below)

### dplyr
example %>%
  group_by(gvkey) %>%
  mutate(avg5 = fun(markup, fyear)) %>%
  ungroup()
# # A tibble: 30 × 4
#    gvkey fyear markup      avg5
#    <int> <int>  <dbl>     <dbl>
#  1  1001  1983  3.02   -0.00764
#  2  1001  1984  2.92    0.0202 
#  3  1001  1985  2.98  NaN      
#  4  1003  1983  0.629  -0.00480
#  5  1003  1984  0.538   0.0674 
#  6  1003  1985  0.745  -0.0119 
#  7  1003  1986  0.899  -0.0847 
#  8  1003  1987  0.572   0.0550 
#  9  1003  1988  0.510   0.217  
# 10  1003  1989  0.621 NaN      
# # … with 20 more rows
# # ℹ Use `print(n = ...)` to see more rows

Upvotes: 2

Related Questions