Reputation: 940
Here is example data:
df = data.frame(id = (1:5),
type= c("a_type","a_type","b_type","b_type", "c_type"),
start_date= lubridate::dmy(c("01/01/2014", "30/04/2014
", "30/04/2015", "10/05/2015", "30/03/2016")),
fail_date = lubridate::dmy(c("30/04/2015", rep(NA,4))))
> df
id type start_date fail_date
1 1 a_type 2014-01-01 2015-04-30
2 1 a_type 2014-04-30 <NA>
3 1 b_type 2015-04-30 <NA>
4 1 b_type 2015-05-10 <NA>
5 1 c_type 2016-03-30 <NA>
I want to fill in fail_date
of "a_type" where they are NA
. This needs to be the start_date of the next type
that is "c_type" or "a_type". So output should be:
> df1
id type start_date fail_date
1 1 a_type 2014-01-01 2015-04-30
2 2 a_type 2014-04-30 2016-03-30
3 3 b_type 2015-04-30 <NA>
4 4 b_type 2015-05-10 <NA>
5 5 c_type 2016-03-30 <NA>
I have started with this code, but don't know how to specify the start_date
of the next entry that is type
"c_type" or "a_type":
df1= df%>%
mutate(fail_date = case_when(is.na(fail_date) & type =="a_type" ~ ??? ,
TRUE ~ fail_date))
I do not want to just filter out the "b_type"
#######Edit##############
Following the answer given by @Allan Cameron I have tried this on my data but have come across an issue.
If the last type
is "b-type" the code won't work, as the cumsum(rev(df$type) != 'b_type')
starts with a 0. Here is an example with amended data:
df = data.frame(id = (1:6),
type= c("a_type","a_type","b_type","b_type", "c_type","b_type"),
start_date= lubridate::dmy(c("01/12/2013","01/01/2014", "30/04/2014
", "30/04/2015", "10/05/2015", "30/03/2016")),
fail_date =rep(NA,6))
df1= df%>%
arrange(start_date) %>%
mutate(next_start = lead(rev(rev(start_date[type != 'b_type'])[
cumsum(rev(type) != 'b_type')])),
fail_date = if_else(type == 'a_type' &
is.na(fail_date) ,
next_start,
lubridate::dmy(fail_date)))
Error: Problem with `mutate()` column `next_start`.
i `next_start = lead(...)`.
i `next_start` must be size 6 or 1, not 5.
Here are some of the individual elements which have helped me to understand the error, but I dont' know how to overcome this:
df1= df%>%
mutate( rev = rev(type),
qa = rev(type) != 'b_type',
qw= cumsum(rev(type)!='b_type')
)
df1
id type start_date fail_date rev qa qw
1 1 a_type 2013-12-01 NA b_type FALSE 0
2 2 a_type 2014-01-01 NA c_type TRUE 1
3 3 b_type 2014-04-30 NA b_type FALSE 1
4 4 b_type 2015-04-30 NA b_type FALSE 1
5 5 c_type 2015-05-10 NA a_type TRUE 2
6 6 b_type 2016-03-30 NA a_type TRUE 3
Because the first entry in qw
is 0, rev(rev(start_date[type != 'b_type'])[ cumsum(rev(type) != 'b_type')])
does not produce a result.
Upvotes: 0
Views: 50
Reputation: 173813
This is tricky. You can do it all within the pipe, but the hard part is back-propagating the next available start date, which requires a combination of lead
, rev
, cumsum
and diff
to create a temporary column. Choosing whether to insert a value from the temporary column just comes down to specifying your logic inside an if_else
df %>%
arrange(start_date) %>%
mutate(next_start = lead(rev(rev(start_date[type != 'b_type'])[
cumsum(rev(type) != 'b_type')])),
fail_date = if_else(type == 'a_type' &
is.na(fail_date) &
lead(type, default = last(type)) != 'a_type',
next_start,
fail_date)) %>%
select(-next_start)
#> id type start_date fail_date
#> 1 1 a_type 2014-01-01 2015-04-30
#> 2 2 a_type 2014-04-30 2016-03-30
#> 3 3 b_type 2015-04-30 <NA>
#> 4 4 b_type 2015-05-10 <NA>
#> 5 5 c_type 2016-03-30 <NA>
If you have multiple types (rather than just 3) you might need to change the instances of type != 'b_type'
to type %in% allowed_types
, where allowed_types
is a pre-defined vector, as shown below:
allowed_types <- c('a_type', 'c_type')
df %>%
arrange(start_date) %>%
mutate(next_start = lead(rev(rev(start_date[type %in% allowed_types])[
cumsum(rev(type %in% allowed_types))])),
fail_date = if_else(type == 'a_type' &
is.na(fail_date) &
lead(type, default = last(type)) != 'a_type',
next_start,
fail_date)) %>%
select(-next_start)
This generates the same result in your example data but is more generalizable
Upvotes: 1