user33484
user33484

Reputation: 568

R - pick up a value from row after and add as new column in a dataframe

> df <- data.frame('unique_ref' = c("a_2016","a_2017","a_2017","a_2016","a_2016"),
+                  'trans_type' = c('NB','NB','CANC','MTA','MTA'),
+                  'incept_dt' = c('01/01/2016','01/01/2017','01/01/2017','01/01/2016','01/01/2016'),
+                  'exp_dt' = c('31/12/2016','31/12/2017','31/12/2017','31/12/2016','31/12/2016'),
+                  'trans_dt' = c('01/01/2016','01/01/2017','01/03/2017','01/07/2016','01/09/2016'))
> df
  unique_ref trans_type  incept_dt     exp_dt   trans_dt
1     a_2016         NB 01/01/2016 31/12/2016 01/01/2016
2     a_2017         NB 01/01/2017 31/12/2017 01/01/2017
3     a_2017       CANC 01/01/2017 31/12/2017 01/03/2017
4     a_2016        MTA 01/01/2016 31/12/2016 01/07/2016
5     a_2016        MTA 01/01/2016 31/12/2016 01/09/2016

I have the above dataset format which has a unique_ref and a few dates. I want to be able to sort this dataset by the unique ref and incept_dt and trans_dt:

> df %>% arrange(unique_ref,incept_dt,trans_dt)
  unique_ref trans_type  incept_dt     exp_dt   trans_dt
1     a_2016         NB 01/01/2016 31/12/2016 01/01/2016
2     a_2016        MTA 01/01/2016 31/12/2016 01/07/2016
3     a_2016        MTA 01/01/2016 31/12/2016 01/09/2016
4     a_2017         NB 01/01/2017 31/12/2017 01/01/2017
5     a_2017       CANC 01/01/2017 31/12/2017 01/03/2017

Now with this sorted dataset I want to create a new column called trans_end_dt which looks at the row below and picks up that rows trans_dt less 1 day. It should do this for every unique_ref, but stop once it has reached the final unique_ref in that grouping and pick up the exp_dt. I.e. the result should be:

> df %>% arrange(unique_ref,incept_dt,trans_dt)
  unique_ref trans_type  incept_dt     exp_dt   trans_dt trans_end_dt
1     a_2016         NB 01/01/2016 31/12/2016 01/01/2016 30/06/2016 #this is 01/07/2016 minus one day
2     a_2016        MTA 01/01/2016 31/12/2016 01/07/2016 31/08/2016 #same logic as above
3     a_2016        MTA 01/01/2016 31/12/2016 01/09/2016 31/12/2016 #next row is a new unique_ref so the value should just be the exp_dt which is 31/12/2016
4     a_2017         NB 01/01/2017 31/12/2017 01/01/2017 28/02/2017
5     a_2017       CANC 01/01/2017 31/12/2017 01/03/2017 31/12/2017

Does anyone know how I can do this? Preferably using dplyr but I am struggling to get this to work so any solution would be great

Upvotes: 0

Views: 161

Answers (2)

Ben
Ben

Reputation: 30474

My approach was to:

  • Change columns ending in dt to Date format
  • Sort as previously done by unique_ref,incept_dt,trans_dt
  • For each unique_ref (using group_by) check if last row in group; if it is, use exp_dt, and if not, use lead(trans_dt) to get the following row data element and subtract 1

    library(dplyr)
    
    df %>% 
      mutate_at(vars(ends_with("_dt")), as.Date, format = "%d/%m/%Y") %>%
      arrange(unique_ref,incept_dt,trans_dt) %>%
      group_by(unique_ref) %>%
      mutate(trans_end_dt = if_else(row_number() == n(), exp_dt,  lead(trans_dt) - 1))
    
      unique_ref trans_type incept_dt  exp_dt     trans_dt   trans_end_dt
      <fct>      <fct>      <date>     <date>     <date>     <date>      
    1 a_2016     NB         2016-01-01 2016-12-31 2016-01-01 2016-06-30  
    2 a_2016     MTA        2016-01-01 2016-12-31 2016-07-01 2016-08-31  
    3 a_2016     MTA        2016-01-01 2016-12-31 2016-09-01 2016-12-31  
    4 a_2017     NB         2017-01-01 2017-12-31 2017-01-01 2017-02-28  
    5 a_2017     CANC       2017-01-01 2017-12-31 2017-03-01 2017-12-31 
    

Edit: As suggested by @Gregor, you can alternatively replace if/else with coalesce (which will fill in NA values):

mutate(trans_end_dt = coalesce(lead(trans_dt) - 1, exp_dt))

Edit: The group_by statement is needed so that the following operations after the %>% pipe are performed "by group." Even though the data has been sorted via arrange, the following mutate statement would consider all of the rows at one time, instead of subgroups a_2016, a_2017, etc. individually.

This is necessary for row_number() == n() where it compares the row number evaluated within the group to the total number of rows in the group (see Select first and last row from grouped data regarding selecting last row of grouped data).

For example, when unique_ref is grouped, there are 3 rows (n() = 3) for a_2016, so if row_number() is 3, essentially the third line in the a_2016 group (or last in the group), then use exp_dt.

Upvotes: 1

Onyambu
Onyambu

Reputation: 79228

You can use lead with the default being the last day of exp_dt

library(tidyverse)
library(lubridate)

df %>% 
  arrange(unique_ref,incept_dt,trans_dt)%>%
  group_by(unique_ref)%>%
  mutate(trans_end_dt = strftime(lead(dmy(trans_dt)-1,1,dmy(exp_dt[n()])),"%d/%m/%Y"))#or format instead of strptime

# A tibble: 5 x 6
# Groups:   unique_ref [2]
  unique_ref trans_type incept_dt  exp_dt     trans_dt   trans_end_dt
  <fct>      <fct>      <fct>      <fct>      <fct>      <chr>       
1 a_2016     NB         01/01/2016 31/12/2016 01/01/2016 30/06/2016  
2 a_2016     MTA        01/01/2016 31/12/2016 01/07/2016 31/08/2016  
3 a_2016     MTA        01/01/2016 31/12/2016 01/09/2016 31/12/2016  
4 a_2017     NB         01/01/2017 31/12/2017 01/01/2017 28/02/2017  
5 a_2017     CANC       01/01/2017 31/12/2017 01/03/2017 31/12/2017 

Upvotes: 0

Related Questions