Reputation: 568
> df <- data.frame('unique_ref' = c("a_2016","a_2017","a_2017","a_2016","a_2016"),
+ 'trans_type' = c('NB','NB','CANC','MTA','MTA'),
+ 'incept_dt' = c('01/01/2016','01/01/2017','01/01/2017','01/01/2016','01/01/2016'),
+ 'exp_dt' = c('31/12/2016','31/12/2017','31/12/2017','31/12/2016','31/12/2016'),
+ 'trans_dt' = c('01/01/2016','01/01/2017','01/03/2017','01/07/2016','01/09/2016'))
> df
unique_ref trans_type incept_dt exp_dt trans_dt
1 a_2016 NB 01/01/2016 31/12/2016 01/01/2016
2 a_2017 NB 01/01/2017 31/12/2017 01/01/2017
3 a_2017 CANC 01/01/2017 31/12/2017 01/03/2017
4 a_2016 MTA 01/01/2016 31/12/2016 01/07/2016
5 a_2016 MTA 01/01/2016 31/12/2016 01/09/2016
I have the above dataset format which has a unique_ref and a few dates. I want to be able to sort this dataset by the unique ref and incept_dt and trans_dt:
> df %>% arrange(unique_ref,incept_dt,trans_dt)
unique_ref trans_type incept_dt exp_dt trans_dt
1 a_2016 NB 01/01/2016 31/12/2016 01/01/2016
2 a_2016 MTA 01/01/2016 31/12/2016 01/07/2016
3 a_2016 MTA 01/01/2016 31/12/2016 01/09/2016
4 a_2017 NB 01/01/2017 31/12/2017 01/01/2017
5 a_2017 CANC 01/01/2017 31/12/2017 01/03/2017
Now with this sorted dataset I want to create a new column called trans_end_dt which looks at the row below and picks up that rows trans_dt less 1 day. It should do this for every unique_ref, but stop once it has reached the final unique_ref in that grouping and pick up the exp_dt. I.e. the result should be:
> df %>% arrange(unique_ref,incept_dt,trans_dt)
unique_ref trans_type incept_dt exp_dt trans_dt trans_end_dt
1 a_2016 NB 01/01/2016 31/12/2016 01/01/2016 30/06/2016 #this is 01/07/2016 minus one day
2 a_2016 MTA 01/01/2016 31/12/2016 01/07/2016 31/08/2016 #same logic as above
3 a_2016 MTA 01/01/2016 31/12/2016 01/09/2016 31/12/2016 #next row is a new unique_ref so the value should just be the exp_dt which is 31/12/2016
4 a_2017 NB 01/01/2017 31/12/2017 01/01/2017 28/02/2017
5 a_2017 CANC 01/01/2017 31/12/2017 01/03/2017 31/12/2017
Does anyone know how I can do this? Preferably using dplyr but I am struggling to get this to work so any solution would be great
Upvotes: 0
Views: 161
Reputation: 30474
My approach was to:
Date
formatunique_ref,incept_dt,trans_dt
For each unique_ref (using group_by
) check if last row in group; if it is, use exp_dt
, and if not, use lead(trans_dt)
to get the following row data element and subtract 1
library(dplyr)
df %>%
mutate_at(vars(ends_with("_dt")), as.Date, format = "%d/%m/%Y") %>%
arrange(unique_ref,incept_dt,trans_dt) %>%
group_by(unique_ref) %>%
mutate(trans_end_dt = if_else(row_number() == n(), exp_dt, lead(trans_dt) - 1))
unique_ref trans_type incept_dt exp_dt trans_dt trans_end_dt
<fct> <fct> <date> <date> <date> <date>
1 a_2016 NB 2016-01-01 2016-12-31 2016-01-01 2016-06-30
2 a_2016 MTA 2016-01-01 2016-12-31 2016-07-01 2016-08-31
3 a_2016 MTA 2016-01-01 2016-12-31 2016-09-01 2016-12-31
4 a_2017 NB 2017-01-01 2017-12-31 2017-01-01 2017-02-28
5 a_2017 CANC 2017-01-01 2017-12-31 2017-03-01 2017-12-31
Edit: As suggested by @Gregor, you can alternatively replace if/else
with coalesce
(which will fill in NA
values):
mutate(trans_end_dt = coalesce(lead(trans_dt) - 1, exp_dt))
Edit: The group_by
statement is needed so that the following operations after the %>%
pipe are performed "by group." Even though the data has been sorted via arrange
, the following mutate statement would consider all of the rows at one time, instead of subgroups a_2016, a_2017, etc. individually.
This is necessary for row_number() == n()
where it compares the row number evaluated within the group to the total number of rows in the group (see Select first and last row from grouped data regarding selecting last row of grouped data).
For example, when unique_ref
is grouped, there are 3 rows (n() = 3
) for a_2016
, so if row_number()
is 3, essentially the third line in the a_2016 group (or last in the group), then use exp_dt
.
Upvotes: 1
Reputation: 79228
You can use lead
with the default being the last day of exp_dt
library(tidyverse)
library(lubridate)
df %>%
arrange(unique_ref,incept_dt,trans_dt)%>%
group_by(unique_ref)%>%
mutate(trans_end_dt = strftime(lead(dmy(trans_dt)-1,1,dmy(exp_dt[n()])),"%d/%m/%Y"))#or format instead of strptime
# A tibble: 5 x 6
# Groups: unique_ref [2]
unique_ref trans_type incept_dt exp_dt trans_dt trans_end_dt
<fct> <fct> <fct> <fct> <fct> <chr>
1 a_2016 NB 01/01/2016 31/12/2016 01/01/2016 30/06/2016
2 a_2016 MTA 01/01/2016 31/12/2016 01/07/2016 31/08/2016
3 a_2016 MTA 01/01/2016 31/12/2016 01/09/2016 31/12/2016
4 a_2017 NB 01/01/2017 31/12/2017 01/01/2017 28/02/2017
5 a_2017 CANC 01/01/2017 31/12/2017 01/03/2017 31/12/2017
Upvotes: 0