Erika Sama
Erika Sama

Reputation: 27

How to merge rows based on 2 datetime variables in R

I would like to merge rows in the dataset df such that for each record ID, if the End_Dateof a row is equal to the Start_Date of the next row, then I would like to combine them.

The current logic that I'm thinking is to first group by ID then loop from row 1 to row n(). If End_Date of row 1 equals to Start_Date of row 2, replace End_Date of row 1 with the End_Date of row 2 and delete row 2. Then compare row 1 with the new row 2 and so on.

Really appreciate any help or suggestions on how to go about writing a code for this. Thank you!

df <- read.table(text = "
ID  Start_Date  End_Date
1   09/09/2016  11/09/2016
1   11/09/2016  17/09/2016
1   17/09/2016  23/11/2016
1   23/11/2016  28/11/2016
1   28/11/2016  14/12/2016
1   14/12/2016  22/12/2016
1   22/12/2016  23/12/2016
1   25/12/2016  03/01/2017
1   06/01/2017  07/02/2017
1   07/02/2017  22/02/2017
1   22/02/2017  27/02/2017
1   28/03/2017  30/05/2017
1   12/06/2017  03/07/2017
1   03/07/2017  04/07/2017
1   14/08/2017  14/08/2017
1   15/08/2017  23/08/2017
1   23/08/2017  24/08/2017
1   24/08/2017  28/08/2017
1   07/09/2017  07/09/2017
1   24/09/2017  24/09/2017
1   01/10/2017  08/11/2017
1   08/11/2017  31/12/2017
2   08/08/2017  10/08/2017
2   10/08/2017  11/08/2017", stringsAsFactors = FALSE, header = TRUE)

The output data should look like:

df_new <- read.table(text = "
ID  Start_Date  End_Date
1   09/09/2016  23/12/2016
1   25/12/2016  03/01/2017
1   06/01/2017  27/02/2017
1   28/03/2017  30/05/2017
1   12/06/2017  04/07/2017
1   14/08/2017  14/08/2017
1   15/08/2017  28/08/2017
1   07/09/2017  07/09/2017
1   24/09/2017  24/09/2017
1   01/10/2017  31/12/2017
2   08/08/2017  11/08/2017", stringsAsFactors = FALSE, header = TRUE)

Upvotes: 2

Views: 364

Answers (1)

Callum Webb
Callum Webb

Reputation: 354

Here's a dplyr solution:

df %>%
  mutate(collapse = Start_Date == lag(End_Date),
         group = cumsum(c(0, tail(!collapse, -1)))) %>%
  group_by(group) %>%
  summarise(Start_Date = first(Start_Date),
            End_Date = last(End_Date)) %>%
  select(Start_Date, End_Date)

I encourage you to run this line by line to inspect what's going on!

First, it creates a new column collapse, which is true if the start date for a record is equal to the end date of the previous record. Because there is no previous record for the first row, you'll get an NA in row number 1.

Next, a new group column is created by taking the cumulative sum of the collapse column, but the NA on row one is trimmed and replaced with a 0. Collapse needs to be negated, because we only want the value to be incremented when we encounter a row that shouldn't be merged.

Finally the group column allows us to do a simple group by and summarise. I've ignored the ID column in this case.

(This also assumes your data are nicely ordered and you're not worried about overlapping date ranges).

Upvotes: 1

Related Questions