How to merge rows based on 2 datetime variables in R

Question

I would like to merge rows in the dataset df such that for each record ID, if the End_Dateof a row is equal to the Start_Date of the next row, then I would like to combine them.

The current logic that I'm thinking is to first group by ID then loop from row 1 to row n(). If End_Date of row 1 equals to Start_Date of row 2, replace End_Date of row 1 with the End_Date of row 2 and delete row 2. Then compare row 1 with the new row 2 and so on.

Really appreciate any help or suggestions on how to go about writing a code for this. Thank you!

df <- read.table(text = "
ID  Start_Date  End_Date
1   09/09/2016  11/09/2016
1   11/09/2016  17/09/2016
1   17/09/2016  23/11/2016
1   23/11/2016  28/11/2016
1   28/11/2016  14/12/2016
1   14/12/2016  22/12/2016
1   22/12/2016  23/12/2016
1   25/12/2016  03/01/2017
1   06/01/2017  07/02/2017
1   07/02/2017  22/02/2017
1   22/02/2017  27/02/2017
1   28/03/2017  30/05/2017
1   12/06/2017  03/07/2017
1   03/07/2017  04/07/2017
1   14/08/2017  14/08/2017
1   15/08/2017  23/08/2017
1   23/08/2017  24/08/2017
1   24/08/2017  28/08/2017
1   07/09/2017  07/09/2017
1   24/09/2017  24/09/2017
1   01/10/2017  08/11/2017
1   08/11/2017  31/12/2017
2   08/08/2017  10/08/2017
2   10/08/2017  11/08/2017", stringsAsFactors = FALSE, header = TRUE)

The output data should look like:

df_new <- read.table(text = "
ID  Start_Date  End_Date
1   09/09/2016  23/12/2016
1   25/12/2016  03/01/2017
1   06/01/2017  27/02/2017
1   28/03/2017  30/05/2017
1   12/06/2017  04/07/2017
1   14/08/2017  14/08/2017
1   15/08/2017  28/08/2017
1   07/09/2017  07/09/2017
1   24/09/2017  24/09/2017
1   01/10/2017  31/12/2017
2   08/08/2017  11/08/2017", stringsAsFactors = FALSE, header = TRUE)

Callum Webb · Accepted Answer

Here's a dplyr solution:

df %>%
  mutate(collapse = Start_Date == lag(End_Date),
         group = cumsum(c(0, tail(!collapse, -1)))) %>%
  group_by(group) %>%
  summarise(Start_Date = first(Start_Date),
            End_Date = last(End_Date)) %>%
  select(Start_Date, End_Date)

I encourage you to run this line by line to inspect what's going on!

First, it creates a new column collapse, which is true if the start date for a record is equal to the end date of the previous record. Because there is no previous record for the first row, you'll get an NA in row number 1.

Next, a new group column is created by taking the cumulative sum of the collapse column, but the NA on row one is trimmed and replaced with a 0. Collapse needs to be negated, because we only want the value to be incremented when we encounter a row that shouldn't be merged.

Finally the group column allows us to do a simple group by and summarise. I've ignored the ID column in this case.

(This also assumes your data are nicely ordered and you're not worried about overlapping date ranges).

How to merge rows based on 2 datetime variables in R

Answers (1)

Related Questions