user63230
user63230

Reputation: 4636

R new variable by group looped on multiple lagged and lead values

Lets say I have three variables id, date, trad (which has 3 values and can be anyone of them at any time point):

library(tidyverse) 
dput(df)
    structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 2, 2, 2), date = structure(c(16436, 16437, 16438, 16439, 
    16440, 16441, 16442, 16443, 16444, 16445, 16446, 16447, 16448, 
    16449, 16450, 16451, 16452, 16453, 16454), class = "Date"), trad = c("Free", 
    "Suspended", "Suspended", "Free", "Suspended", "Withdrawn", "Withdrawn", 
    "Free", "Withdrawn", "Free", "Free", "Withdrawn", "Suspended", 
    "Withdrawn", "Withdrawn", "Free", "Withdrawn", "Suspended", "Free"
    )), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
    -19L), spec = structure(list(cols = list(id = structure(list(), class = c("collector_double", 
    "collector")), date = structure(list(format = "%d/%m/%Y"), class = c("collector_date", 
    "collector")), trad = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))
    df
    # A tibble: 19 x 3
          id date       trad     
       <dbl> <date>     <chr>    
     1     1 2015-01-01 Free     
     2     1 2015-01-02 Suspended
     3     1 2015-01-03 Suspended
     4     1 2015-01-04 Free     
     5     1 2015-01-05 Suspended
     6     1 2015-01-06 Withdrawn
     7     1 2015-01-07 Withdrawn
     8     1 2015-01-08 Free     
     9     1 2015-01-09 Withdrawn
    10     1 2015-01-10 Free     
    11     1 2015-01-11 Free     
    12     1 2015-01-12 Withdrawn
    13     1 2015-01-13 Suspended
    14     1 2015-01-14 Withdrawn
    15     1 2015-01-15 Withdrawn
    16     1 2015-01-16 Free     
    17     2 2015-01-17 Withdrawn
    18     2 2015-01-18 Suspended
    19     2 2015-01-19 Free 

I would like to generate new columns with the start and end dates of when a period starts. A period starts when trad moves to status "Withdrawn" with the cavet that if there is a status "Suspended" before the "Withdrawn" row, the start date moves to this row. If there are multiple rows of "Suspended" before "Withdrawn", then start begins with the first "Suspended". Similarly, the end date is when trad goes to Free after being in "Withdrawn". This is required final dataset:

dfnew
# A tibble: 19 x 6
      id date       trad      start      end        period
   <dbl> <date>     <chr>     <date>     <date>      <dbl>
 1     1 2015-01-01 Free      NA         NA             NA
 2     1 2015-01-02 Suspended NA         NA             NA
 3     1 2015-01-03 Suspended NA         NA             NA
 4     1 2015-01-04 Free      NA         NA             NA
 5     1 2015-01-05 Suspended 2015-01-05 NA              1
 6     1 2015-01-06 Withdrawn NA         NA              1
 7     1 2015-01-07 Withdrawn NA         NA              1
 8     1 2015-01-08 Free      NA         2015-01-08      1
 9     1 2015-01-09 Withdrawn 2015-01-09 NA              2
10     1 2015-01-10 Free      NA         2015-01-10      2
11     1 2015-01-11 Free      NA         NA             NA
12     1 2015-01-12 Withdrawn 2015-01-12 NA              3
13     1 2015-01-13 Suspended NA         NA              3
14     1 2015-01-14 Withdrawn NA         NA              3
15     1 2015-01-15 Withdrawn NA         NA              3
16     1 2015-01-16 Free      NA         2015-01-16     NA
17     2 2015-01-17 Withdrawn 2015-01-17 NA              1
18     2 2015-01-18 Suspended NA         NA              1
19     2 2015-01-19 Free      NA         2015-01-19      1 

There is no pattern in trad so you could have any sequence of "Withdrawn"/"Suspended" before "Free" so a solution something like this doesn't work (in theory it could, but I would need too many conditions to implement it):

dfnew <- df %>% 
  group_by(id)
  mutate(start = ifelse(trad == "Withdrawn" & lag(trad == "Free"), date, NA))

These questions are helpful but don't answer the question:

How to extract the previous n rows where a certain column value cannot be a particular value?

R - Conditional lagging - How to lag a certain amount of cells until a condition is met?

Would anyone have a flexible solution?

Upvotes: 0

Views: 267

Answers (1)

h1427096
h1427096

Reputation: 293

Not very flexible, but at least a try.

I don't know what happens when we have sequence Suspended, Suspended, Withdrawn, Withdrawn.

For example change trad on 2015-01-04 to Suspended. When is the start date in this case? I gave 2 solutions, first makes start date on 2015-01-02 and the second on 2015-01-05

dfnew1 <- df %>% 
    mutate(startGroups = cumsum(trad == "Free")) %>% 
    group_by(startGroups) %>% # make a group from every occurance of "Free" in trad
    mutate(wds = cumsum(trad == "Withdrawn"),
           start = ifelse(max(wds) > 0 & row_number() == 2, date, NA) # if there is any "Withdrawn" in the group set start date right after "Free" 
           ) %>% 
    ungroup() %>% 
    mutate(endGroups = cumsum(!is.na(start))) %>% 
    group_by(endGroups) %>% # group on every open trade now
    mutate(frees = cumsum(trad == "Free"),
           end = ifelse(frees == 1 & endGroups > 0, date, NA) #end on first occurance of "Free" in trad column
           ) # %>% select(-startGroups, wds, endGroups, frees) # remove cols

dfnew2 <- df %>% 
    mutate(startGroups = cumsum(trad == "Free")) %>% 
    group_by(startGroups) %>% # make a group from every occurance of "Free" in trad
    mutate(wds = cumsum(trad == "Withdrawn"),
           start = ifelse(
                        (trad == "Suspended" & lead(trad) == "Withdrawn" & lead(wds) == 1 |
                            trad == "Withdrawn" & lag(trad) != "Suspended" & wds == 1), 
                       date, NA) # first trad in group. Other option: 
    ) %>% 
    ungroup() %>% 
    mutate(endGroups = cumsum(!is.na(start))) %>% 
    group_by(endGroups) %>% 
    mutate(frees = cumsum(trad == "Free"),
           end = ifelse(frees == 1 & endGroups > 0, date, NA)
    )  #%>% select(-startGroups, wds, endGroups, frees)

Upvotes: 1

Related Questions