Reputation: 311
Given the structure of my data below, I would like to calculate the forward five years average annual growth rate (where the first year is lead(markup)
of the column markup
for each individual (individual are identified in the column (gvkey)
), and add to the data frame that average as a column. However, some individuals have less than five years observations, and for all individuals', their last 4 years observations, have less than 5 years of observations ahead of them. For those cases, the average annual growth rate should adjust to the number of observation ahead of them (with a maximum of 5).
dput(example)
structure(list(gvkey = c(1001L, 1001L, 1001L, 1003L, 1003L, 1003L,
1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1004L, 1004L, 1004L, 1004L), fyear = c(1983L, 1984L,
1985L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1980L,
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L,
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L,
1999L), markup = c(3.02456418383518, 2.91714600416106, 2.97620103473762,
0.628645648836935, 0.538264738598443, 0.74536402337831, 0.89905329776662,
0.571759161863088, 0.510497237569061, 0.621391904401246, 0.320146680750145,
0.277978758953348, 0.31442332968701, 0.319433516915814, 0.324865816687745,
0.335264348013352, 0.328048313395744, 0.326632245360565, 0.340874293859881,
0.320374201245953, 0.27456562124358, 0.276693369097675, 0.245072145096866,
0.241026046834387, 0.242841330851661, 0.249635000371186, 0.257903948772679,
0.262641379065405, 0.261534064206543, 0.22953354130982)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
gvkey = c(1001L, 1003L, 1004L), .rows = structure(list(1:3,
4:10, 11:30), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))
This is as far as I got:
example %>%
filter(fyear %in% 1980:2019)%>%
group_by((gvkey))%>%
mutate(markupchange = ((((lead(markup)-markup)/markup)+(lead(markup, K =2)-lead(markup)/lead(markup))+(lead(markup, K =3)-lead(markup, k =2)/lead(markup, K=2))+(lead(markup, K =4)-lead(markup, k =3)/lead(markup, K=3))+(lead(markup, K =5)-lead(markup, k =4)/lead(markup, K=4))/5)))
What I can't figure out how to indicate is to shorten the lenght of the average annual growth rate for thoses cases with less than 5 observations ahead.
As an ouput I would like to get back the same data frame with the extra column for the average annual growth rate of the markup
.
The value in row 1 of the added column should be -0,00628231878798876 and in the second row 0,020547945.
Many thanks for any tips.
Upvotes: 1
Views: 184
Reputation: 1253
Here's a solution using frollmean()
which might be more computationally efficient.
First we need a helper to create rolling partial windows for use in frollmean()
.
We'll use it to set bounds on the window sizes based on the group sizes.
# Get rolling window sizes, including partial
window_seq <- function(k, n, partial = TRUE){
if (length(k) != 1L) stop("k must be of length 1.")
if (length(n) != 1L) stop("n must be of length 1.")
if (n > .Machine[["integer.max"]]){
stop("n must not be greater than .Machine$integer.max")
}
n <- as.integer(n)
k <- as.integer(k)
k <- min(k, n) # Bound k to <= n
k <- max(k, 0L) # Bound k to >= 0
pk <- max(k - 1L, 0L) # Partial k, bounded to >= 0
p_seq <- seq_len(pk) # Partial window sequence
out <- rep_len(k, n)
# Replace partial part with partial sequence
if (partial){
out[p_seq] <- p_seq
} else {
out[p_seq] <- NA_integer_
}
out
}
library(data.table)
# data.table
example2 <- copy(example)
setDT(example2)
example2[, pchange := (markup/shift(markup)) - 1,
by = gvkey]
example2[, row_id := rowid(gvkey)]
setorderv(example2, c("gvkey", "row_id"), order = c(1L, -1L))
example2[, avg5 := frollmean(shift(pchange), n = window_seq(5, .N),
adaptive = T, na.rm = T),
by = gvkey]
setorderv(example2, c("gvkey", "row_id"))
print(example2, n = 10)
#> gvkey fyear markup pchange row_id avg5
#> 1: 1001 1983 3.0245642 NA 1 -0.007635573
#> 2: 1001 1984 2.9171460 -0.035515259 2 0.020244112
#> 3: 1001 1985 2.9762010 0.020244112 3 NaN
#> 4: 1003 1983 0.6286456 NA 1 -0.004802629
#> 5: 1003 1984 0.5382647 -0.143770836 2 0.067397285
#> ---
#> 26: 1004 1995 0.2496350 0.027975755 16 -0.018769986
#> 27: 1004 1996 0.2579039 0.033124155 17 -0.036068033
#> 28: 1004 1997 0.2626414 0.018368972 18 -0.063286535
#> 29: 1004 1998 0.2615341 -0.004216072 19 -0.122356998
#> 30: 1004 1999 0.2295335 -0.122356998 20 NaN
library(dplyr)
example %>%
group_by(gvkey) %>%
mutate(pchange = (markup/lag(markup)) - 1) %>%
mutate(row_id = row_number()) %>%
arrange(desc(row_id), .by_group = TRUE) %>%
mutate(avg5 = frollmean(lag(pchange), n = window_seq(5, n()),
adaptive = T, na.rm = T)) %>%
arrange(row_id, .by_group = TRUE)
#> # A tibble: 30 x 6
#> # Groups: gvkey [3]
#> gvkey fyear markup pchange row_id avg5
#> <int> <int> <dbl> <dbl> <int> <dbl>
#> 1 1001 1983 3.02 NA 1 -0.00764
#> 2 1001 1984 2.92 -0.0355 2 0.0202
#> 3 1001 1985 2.98 0.0202 3 NaN
#> 4 1003 1983 0.629 NA 1 -0.00480
#> 5 1003 1984 0.538 -0.144 2 0.0674
#> 6 1003 1985 0.745 0.385 3 -0.0119
#> 7 1003 1986 0.899 0.206 4 -0.0847
#> 8 1003 1987 0.572 -0.364 5 0.0550
#> 9 1003 1988 0.510 -0.107 6 0.217
#> 10 1003 1989 0.621 0.217 7 NaN
#> # i 20 more rows
For actual "growth rates", one can use geometric mean of percent changes which gives you the expected percent change per unit time.
# Growth rates
example %>%
group_by(gvkey) %>%
mutate(pchange = (markup/lag(markup))) %>%
mutate(row_id = row_number()) %>%
arrange(desc(row_id), .by_group = TRUE) %>%
mutate(growth_rate5 = exp(frollmean(log(lag(pchange)), n = window_seq(5, n()),
adaptive = T, na.rm = T)) - 1) %>%
arrange(row_id, .by_group = TRUE)
#> # A tibble: 30 x 6
#> # Groups: gvkey [3]
#> gvkey fyear markup pchange row_id growth_rate5
#> <int> <int> <dbl> <dbl> <int> <dbl>
#> 1 1001 1983 3.02 NA 1 -0.00803
#> 2 1001 1984 2.92 0.964 2 0.0202
#> 3 1001 1985 2.98 1.02 3 NaN
#> 4 1003 1983 0.629 NA 1 -0.0408
#> 5 1003 1984 0.538 0.856 2 0.0291
#> 6 1003 1985 0.745 1.38 3 -0.0445
#> 7 1003 1986 0.899 1.21 4 -0.116
#> 8 1003 1987 0.572 0.636 5 0.0425
#> 9 1003 1988 0.510 0.893 6 0.217
#> 10 1003 1989 0.621 1.22 7 NaN
#> # i 20 more rows
Upvotes: 0
Reputation: 160417
Since your system is complaining about the size of your data, I'm going to recommend a non-join method (see the answer history for the first cut on this).
data.table
variant. I'll also provide the dplyr
version that works with this data, but it may not work with your larger data, in which case data.table
(with its in-place referential semantics) will be preferred.First, a quick helper-function:
fun <- function(mkup, fyr, span=5) {
sapply(fyr, function(yr) {
val <- mkup[between(fyr, yr, yr+span)]
mean(c(diff(val), NA) / val, na.rm = TRUE)
})
}
noting that between
can be from either dplyr
or data.table
, they are equivalent in this use.
### data.table
library(data.table)
EX <- as.data.table(example)
EX[, avg5 := fun(markup, fyear), by = gvkey]
### (same result as below)
### dplyr
example %>%
group_by(gvkey) %>%
mutate(avg5 = fun(markup, fyear)) %>%
ungroup()
# # A tibble: 30 × 4
# gvkey fyear markup avg5
# <int> <int> <dbl> <dbl>
# 1 1001 1983 3.02 -0.00764
# 2 1001 1984 2.92 0.0202
# 3 1001 1985 2.98 NaN
# 4 1003 1983 0.629 -0.00480
# 5 1003 1984 0.538 0.0674
# 6 1003 1985 0.745 -0.0119
# 7 1003 1986 0.899 -0.0847
# 8 1003 1987 0.572 0.0550
# 9 1003 1988 0.510 0.217
# 10 1003 1989 0.621 NaN
# # … with 20 more rows
# # ℹ Use `print(n = ...)` to see more rows
Upvotes: 2