Erwin Rhine
Erwin Rhine

Reputation: 303

How to create an indicator function based on existence of past occurrences?

I have a time-series panel dataset that is structured in the following way: There are multiple funds that each own multiple stocks and we have a value column for the stock. As you can see the panel is not balanced. My actual dataset is very large with each fund having at least 500 stocks and different quarters being represented with some having missing quarter values.

df <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
  stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
           "2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
  value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)


> df
   fund_id stock_id  year_q value
1        1        1 2011-03     1
2        1        1 2011-06     2
3        1        1 2011-09     1
4        1        1 2011-12     3
5        1        1 2012-03     4
6        1        1 2012-06     2
7        1        2 2011-12     1
8        1        2 2012-03     2
9        1        2 2012-06     3
10       1        2 2012-09     4
11       1        2 2012-12     2
12       1        2 2013-03     1
13       1        2 2013-06     3
14       2        1 2014-09     1
15       2        1 2015-03     1
16       2        3 2013-03     3
17       2        3 2013-06     2
18       2        3 2013-09     3
19       2        3 2013-12     1

I would like to create an indicator function that puts a True value if a stock was present in that quarter or any of the past 3 quarters within that fund. Here is the result that I'm looking for:

result <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
             "2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
             "2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03","2013-03",
             "2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
  stock_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,3,3,3,3,3,3,3,3,3),
  Indicator = c(T,T,T,T,T,T,T,T,T,F,F,F,F,T,T,T,T,T,T,T,F,F,F,F,F,F,T,T,T,T,T,T,T,T,T,T,F,F)
)

   fund_id  year_q stock_id Indicator
1        1 2011-03        1      TRUE
2        1 2011-06        1      TRUE
3        1 2011-09        1      TRUE
4        1 2011-12        1      TRUE
5        1 2012-03        1      TRUE
6        1 2012-06        1      TRUE
7        1 2012-09        1      TRUE
8        1 2012-12        1      TRUE
9        1 2013-03        1      TRUE
10       1 2013-06        1     FALSE
11       1 2011-03        2     FALSE
12       1 2011-06        2     FALSE
13       1 2011-09        2     FALSE
14       1 2011-12        2      TRUE
15       1 2012-03        2      TRUE
16       1 2012-06        2      TRUE
17       1 2012-09        2      TRUE
18       1 2012-12        2      TRUE
19       1 2013-03        2      TRUE
20       1 2013-06        2      TRUE
21       2 2013-03        1     FALSE
22       2 2013-06        1     FALSE
23       2 2013-09        1     FALSE
24       2 2013-12        1     FALSE
25       2 2014-03        1     FALSE
26       2 2014-06        1     FALSE
27       2 2014-09        1      TRUE
28       2 2014-12        1      TRUE
29       2 2015-03        1      TRUE
30       2 2013-03        3      TRUE
31       2 2013-06        3      TRUE
32       2 2013-09        3      TRUE
33       2 2013-12        3      TRUE
34       2 2014-03        3      TRUE
35       2 2014-06        3      TRUE
36       2 2014-09        3      TRUE
37       2 2014-12        3     FALSE
38       2 2015-03        3     FALSE

Please note that in some cases the quarters are not successive and there might be a missing quarter. (if this is too difficult to deal with you can also ignore this condition)

Additionally, I would also like to create a value zero for a quarter if the stock that was present in any of the previous 3 quarters, no longer exists in that fund. (this is not that important though). I've been trying multiple loop solutions but since the data is massive it doesn't work that well. My ideal solution would be something using dplyr or datatable.

Upvotes: 0

Views: 315

Answers (1)

r2evans
r2evans

Reputation: 160677

I'll demonstrate using zoo::rollapplyr for the "this and previous 3 quarters" (4-wide window) in a dplyr pipe. We first need to "fill out" the quarters, which I'll do by summarizing, converting to Date, then sequence by 3 months, merge back in to df, and then roll the calcs.

Note, compared to your result, in some cases my "forecasting" goes further into the future. I don't know if this is an inconsistency or just an indication of you estimating what you wanted result to look like.

Up front, the full code and results, then I'll walk through it.

library(dplyr)
# library(purrr) # map2
# library(tidyr) # unnest
out <- df %>%
  group_by(fund_id) %>%
  mutate(miny = min(year_q), maxy = max(year_q)) %>%
  distinct(fund_id, stock_id, miny, maxy) %>%
  group_by(fund_id, stock_id) %>%
  mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
  transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
  tidyr::unnest(year_q) %>%
  full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
  arrange(fund_id, stock_id, year_q) %>% # only 'year_q' is strictly required, other 2 are aesthetic
  mutate(
    Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
  ) %>%
  ungroup() %>%
  mutate(value = coalesce(value, 0))

out
# # A tibble: 38 x 5
#    fund_id stock_id year_q  value Indicator2
#      <dbl>    <dbl> <chr>   <dbl> <lgl>     
#  1       1        1 2011-03     1 TRUE      
#  2       1        1 2011-06     2 TRUE      
#  3       1        1 2011-09     1 TRUE      
#  4       1        1 2011-12     3 TRUE      
#  5       1        1 2012-03     4 TRUE      
#  6       1        1 2012-06     2 TRUE      
#  7       1        1 2012-09     0 TRUE      
#  8       1        1 2012-12     0 TRUE      
#  9       1        1 2013-03     0 TRUE      
# 10       1        1 2013-06     0 FALSE     
# # ... with 28 more rows

Quick validation, works conveniently because the order is the same:

with(full_join(out, result, by = c("fund_id", "stock_id", "year_q")),
     identical(Indicator, Indicator2))
# [1] TRUE

Walk-through:

  1. Find the min/max year_q for each fund_id:

    df %>%
      group_by(fund_id) %>%
      mutate(miny = min(year_q), maxy = max(year_q)) %>%
      distinct(fund_id, stock_id, miny, maxy)
    # # A tibble: 4 x 4
    # # Groups:   fund_id [2]
    #   fund_id stock_id miny    maxy   
    #     <dbl>    <dbl> <chr>   <chr>  
    # 1       1        1 2011-03 2013-06
    # 2       1        2 2011-03 2013-06
    # 3       2        1 2013-03 2015-03
    # 4       2        3 2013-03 2015-03
    
  2. "Fill" that out so that each stock_id is covered for the full time-frame of the fund_id:

    ... %>%
      group_by(fund_id, stock_id) %>%
      mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
      transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
      tidyr::unnest(year_q)
    # # A tibble: 38 x 3
    # # Groups:   fund_id, stock_id [4]
    #    fund_id stock_id year_q 
    #      <dbl>    <dbl> <chr>  
    #  1       1        1 2011-03
    #  2       1        1 2011-06
    #  3       1        1 2011-09
    #  4       1        1 2011-12
    #  5       1        1 2012-03
    #  6       1        1 2012-06
    #  7       1        1 2012-09
    #  8       1        1 2012-12
    #  9       1        1 2013-03
    # 10       1        1 2013-06
    # # ... with 28 more rows
    
  3. Join the original data back in. This shows us quarters that were missing.

    ... %>%
      full_join(df, by = c("fund_id", "stock_id", "year_q")) %>%
      arrange(fund_id, stock_id, year_q)
    # # A tibble: 38 x 4
    # # Groups:   fund_id, stock_id [4]
    #    fund_id stock_id year_q  value
    #      <dbl>    <dbl> <chr>   <dbl>
    #  1       1        1 2011-03     1
    #  2       1        1 2011-06     2
    #  3       1        1 2011-09     1
    #  4       1        1 2011-12     3
    #  5       1        1 2012-03     4
    #  6       1        1 2012-06     2
    #  7       1        1 2012-09    NA
    #  8       1        1 2012-12    NA
    #  9       1        1 2013-03    NA
    # 10       1        1 2013-06    NA
    # # ... with 28 more rows
    
  4. Perform the "rolling" calculation. This is the majority of "work": for each value, we look at it and the previous 3 values (4 minus 1 for the current gives us 3 previous values) and determine if any are non-NA. partial=TRUE means that for the first value (that has no previous), we look at it alone; for the second value, we look at it and the 1 previous; etc. With partial=TRUE, this returns the same length as value; if partial=FALSE (the default), then it would return length(value) - (4-1) values, which is not what we want.

    ... %>%
      # only 'year_q' is strictly required, other 2 are aesthetic
      mutate(
        Indicator2 = zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE)
      )
    # # A tibble: 38 x 5
    # # Groups:   fund_id, stock_id [4]
    #    fund_id stock_id year_q  value Indicator2
    #      <dbl>    <dbl> <chr>   <dbl> <lgl>     
    #  1       1        1 2011-03     1 TRUE      
    #  2       1        1 2011-06     2 TRUE      
    #  3       1        1 2011-09     1 TRUE      
    #  4       1        1 2011-12     3 TRUE      
    #  5       1        1 2012-03     4 TRUE      
    #  6       1        1 2012-06     2 TRUE      
    #  7       1        1 2012-09    NA TRUE      
    #  8       1        1 2012-12    NA TRUE      
    #  9       1        1 2013-03    NA TRUE      
    # 10       1        1 2013-06    NA FALSE     
    # # ... with 28 more rows
    
  5. Finally, arrange (for convenience only) and convert all is.na(value) to 0.


data.table

library(data.table)
DT <- as.data.table(df)
year_qs <- distinct(
  DT[, c("miny", "maxy") := .(min(year_q), max(year_q)), by = .(fund_id)
     ][, .(fund_id, stock_id, miny, maxy)]
)[, c("miny","maxy") := lapply(.SD, function(z) as.Date(paste0(z, "-01"))), .SDcols = c("miny","maxy")
  ][, .(year_q = Map(function(a, b) format(seq.Date(a, b, by = "3 months"), format = "%Y-%m"), miny, maxy)),
    by = .(fund_id, stock_id)
    ][, tidyr::unnest(.SD, year_q) ]
setDT(year_qs)
DT[, c("miny", "maxy") := NULL]

DT <- rbindlist(list(
  DT,
  year_qs[!DT, on = .(fund_id, stock_id, year_q)]),
  use.names = TRUE, fill = TRUE)
setorder(DT, fund_id, stock_id, year_q)

DT[, Indicator := zoo::rollapplyr(value, 4, FUN = function(z) any(!is.na(z)), partial = TRUE),
   by = .(fund_id, stock_id)
   ][, value := fcoalesce(value, 0)]

Upvotes: 2

Related Questions