Mark Davies
Mark Davies

Reputation: 940

Filling NA data with data from another row with matching

Here is example data:

df = data.frame(id = (1:5),
                type= c("a_type","a_type","b_type","b_type", "c_type"),
                start_date= lubridate::dmy(c("01/01/2014", "30/04/2014
", "30/04/2015", "10/05/2015", "30/03/2016")),
                fail_date = lubridate::dmy(c("30/04/2015", rep(NA,4))))
> df
  id   type start_date  fail_date
1  1 a_type 2014-01-01 2015-04-30
2  1 a_type 2014-04-30       <NA>
3  1 b_type 2015-04-30       <NA>
4  1 b_type 2015-05-10       <NA>
5  1 c_type 2016-03-30       <NA>

I want to fill in fail_date of "a_type" where they are NA. This needs to be the start_date of the next type that is "c_type" or "a_type". So output should be:

> df1
  id   type start_date  fail_date
1  1 a_type 2014-01-01 2015-04-30
2  2 a_type 2014-04-30 2016-03-30
3  3 b_type 2015-04-30       <NA>
4  4 b_type 2015-05-10       <NA>
5  5 c_type 2016-03-30       <NA>

I have started with this code, but don't know how to specify the start_date of the next entry that is type "c_type" or "a_type":

df1= df%>%
  mutate(fail_date = case_when(is.na(fail_date) & type =="a_type" ~ ??? ,
                               TRUE ~ fail_date))

I do not want to just filter out the "b_type"

#######Edit##############

Following the answer given by @Allan Cameron I have tried this on my data but have come across an issue.

If the last type is "b-type" the code won't work, as the cumsum(rev(df$type) != 'b_type') starts with a 0. Here is an example with amended data:

df = data.frame(id = (1:6),
                type= c("a_type","a_type","b_type","b_type", "c_type","b_type"),
                start_date= lubridate::dmy(c("01/12/2013","01/01/2014", "30/04/2014
", "30/04/2015", "10/05/2015", "30/03/2016")),
                fail_date =rep(NA,6))

df1= df%>%
  arrange(start_date) %>%
  mutate(next_start = lead(rev(rev(start_date[type != 'b_type'])[
    cumsum(rev(type) != 'b_type')])),
    fail_date = if_else(type == 'a_type' & 
                          is.na(fail_date) ,
                        next_start, 
                        lubridate::dmy(fail_date)))

Error: Problem with `mutate()` column `next_start`.
i `next_start = lead(...)`.
i `next_start` must be size 6 or 1, not 5.

Here are some of the individual elements which have helped me to understand the error, but I dont' know how to overcome this:

df1= df%>%
  mutate(         rev =  rev(type),
         qa = rev(type) != 'b_type',
         qw= cumsum(rev(type)!='b_type')
         )

df1
  id   type start_date fail_date    rev    qa qw
1  1 a_type 2013-12-01        NA b_type FALSE  0
2  2 a_type 2014-01-01        NA c_type  TRUE  1
3  3 b_type 2014-04-30        NA b_type FALSE  1
4  4 b_type 2015-04-30        NA b_type FALSE  1
5  5 c_type 2015-05-10        NA a_type  TRUE  2
6  6 b_type 2016-03-30        NA a_type  TRUE  3

Because the first entry in qw is 0, rev(rev(start_date[type != 'b_type'])[ cumsum(rev(type) != 'b_type')]) does not produce a result.

Upvotes: 0

Views: 50

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173813

This is tricky. You can do it all within the pipe, but the hard part is back-propagating the next available start date, which requires a combination of lead, rev, cumsum and diff to create a temporary column. Choosing whether to insert a value from the temporary column just comes down to specifying your logic inside an if_else

df %>%
  arrange(start_date) %>%
  mutate(next_start = lead(rev(rev(start_date[type != 'b_type'])[
                      cumsum(rev(type) != 'b_type')])),
         fail_date = if_else(type == 'a_type' & 
                             is.na(fail_date) &
                             lead(type, default = last(type)) != 'a_type',
                             next_start, 
                             fail_date)) %>%
  select(-next_start)
#>   id   type start_date  fail_date
#> 1  1 a_type 2014-01-01 2015-04-30
#> 2  2 a_type 2014-04-30 2016-03-30
#> 3  3 b_type 2015-04-30       <NA>
#> 4  4 b_type 2015-05-10       <NA>
#> 5  5 c_type 2016-03-30       <NA>

If you have multiple types (rather than just 3) you might need to change the instances of type != 'b_type' to type %in% allowed_types, where allowed_types is a pre-defined vector, as shown below:

allowed_types <- c('a_type', 'c_type')

df %>%
  arrange(start_date) %>%
  mutate(next_start = lead(rev(rev(start_date[type %in% allowed_types])[
                      cumsum(rev(type %in% allowed_types))])),
         fail_date = if_else(type == 'a_type' & 
                             is.na(fail_date) &
                             lead(type, default = last(type)) != 'a_type',
                             next_start, 
                             fail_date)) %>%
  select(-next_start)

This generates the same result in your example data but is more generalizable

Upvotes: 1

Related Questions