Reputation: 4636
Lets say I have three variables id
, date
, trad
(which has 3 values and can be anyone of them at any time point):
library(tidyverse)
dput(df)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2), date = structure(c(16436, 16437, 16438, 16439,
16440, 16441, 16442, 16443, 16444, 16445, 16446, 16447, 16448,
16449, 16450, 16451, 16452, 16453, 16454), class = "Date"), trad = c("Free",
"Suspended", "Suspended", "Free", "Suspended", "Withdrawn", "Withdrawn",
"Free", "Withdrawn", "Free", "Free", "Withdrawn", "Suspended",
"Withdrawn", "Withdrawn", "Free", "Withdrawn", "Suspended", "Free"
)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-19L), spec = structure(list(cols = list(id = structure(list(), class = c("collector_double",
"collector")), date = structure(list(format = "%d/%m/%Y"), class = c("collector_date",
"collector")), trad = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
df
# A tibble: 19 x 3
id date trad
<dbl> <date> <chr>
1 1 2015-01-01 Free
2 1 2015-01-02 Suspended
3 1 2015-01-03 Suspended
4 1 2015-01-04 Free
5 1 2015-01-05 Suspended
6 1 2015-01-06 Withdrawn
7 1 2015-01-07 Withdrawn
8 1 2015-01-08 Free
9 1 2015-01-09 Withdrawn
10 1 2015-01-10 Free
11 1 2015-01-11 Free
12 1 2015-01-12 Withdrawn
13 1 2015-01-13 Suspended
14 1 2015-01-14 Withdrawn
15 1 2015-01-15 Withdrawn
16 1 2015-01-16 Free
17 2 2015-01-17 Withdrawn
18 2 2015-01-18 Suspended
19 2 2015-01-19 Free
I would like to generate new columns with the start and end dates of when a period starts. A period starts when trad
moves to status "Withdrawn"
with the cavet that if there is a status "Suspended"
before the "Withdrawn"
row, the start date moves to this row. If there are multiple rows of "Suspended"
before "Withdrawn"
, then start begins with the first "Suspended"
. Similarly, the end date is when trad
goes to Free
after being in "Withdrawn"
. This is required final dataset:
dfnew
# A tibble: 19 x 6
id date trad start end period
<dbl> <date> <chr> <date> <date> <dbl>
1 1 2015-01-01 Free NA NA NA
2 1 2015-01-02 Suspended NA NA NA
3 1 2015-01-03 Suspended NA NA NA
4 1 2015-01-04 Free NA NA NA
5 1 2015-01-05 Suspended 2015-01-05 NA 1
6 1 2015-01-06 Withdrawn NA NA 1
7 1 2015-01-07 Withdrawn NA NA 1
8 1 2015-01-08 Free NA 2015-01-08 1
9 1 2015-01-09 Withdrawn 2015-01-09 NA 2
10 1 2015-01-10 Free NA 2015-01-10 2
11 1 2015-01-11 Free NA NA NA
12 1 2015-01-12 Withdrawn 2015-01-12 NA 3
13 1 2015-01-13 Suspended NA NA 3
14 1 2015-01-14 Withdrawn NA NA 3
15 1 2015-01-15 Withdrawn NA NA 3
16 1 2015-01-16 Free NA 2015-01-16 NA
17 2 2015-01-17 Withdrawn 2015-01-17 NA 1
18 2 2015-01-18 Suspended NA NA 1
19 2 2015-01-19 Free NA 2015-01-19 1
There is no pattern in trad
so you could have any sequence of "Withdrawn"
/"Suspended"
before "Free"
so a solution something like this doesn't work (in theory it could, but I would need too many conditions to implement it):
dfnew <- df %>%
group_by(id)
mutate(start = ifelse(trad == "Withdrawn" & lag(trad == "Free"), date, NA))
These questions are helpful but don't answer the question:
How to extract the previous n rows where a certain column value cannot be a particular value?
R - Conditional lagging - How to lag a certain amount of cells until a condition is met?
Would anyone have a flexible solution?
Upvotes: 0
Views: 267
Reputation: 293
Not very flexible, but at least a try.
I don't know what happens when we have sequence Suspended, Suspended, Withdrawn, Withdrawn.
For example change trad on 2015-01-04 to Suspended. When is the start date in this case? I gave 2 solutions, first makes start date on 2015-01-02 and the second on 2015-01-05
dfnew1 <- df %>%
mutate(startGroups = cumsum(trad == "Free")) %>%
group_by(startGroups) %>% # make a group from every occurance of "Free" in trad
mutate(wds = cumsum(trad == "Withdrawn"),
start = ifelse(max(wds) > 0 & row_number() == 2, date, NA) # if there is any "Withdrawn" in the group set start date right after "Free"
) %>%
ungroup() %>%
mutate(endGroups = cumsum(!is.na(start))) %>%
group_by(endGroups) %>% # group on every open trade now
mutate(frees = cumsum(trad == "Free"),
end = ifelse(frees == 1 & endGroups > 0, date, NA) #end on first occurance of "Free" in trad column
) # %>% select(-startGroups, wds, endGroups, frees) # remove cols
dfnew2 <- df %>%
mutate(startGroups = cumsum(trad == "Free")) %>%
group_by(startGroups) %>% # make a group from every occurance of "Free" in trad
mutate(wds = cumsum(trad == "Withdrawn"),
start = ifelse(
(trad == "Suspended" & lead(trad) == "Withdrawn" & lead(wds) == 1 |
trad == "Withdrawn" & lag(trad) != "Suspended" & wds == 1),
date, NA) # first trad in group. Other option:
) %>%
ungroup() %>%
mutate(endGroups = cumsum(!is.na(start))) %>%
group_by(endGroups) %>%
mutate(frees = cumsum(trad == "Free"),
end = ifelse(frees == 1 & endGroups > 0, date, NA)
) #%>% select(-startGroups, wds, endGroups, frees)
Upvotes: 1