Reputation: 27
I would like to merge rows in the dataset df
such that for each record ID, if the End_Date
of a row is equal to the Start_Date
of the next row, then I would like to combine them.
The current logic that I'm thinking is to first group by ID then loop from row 1 to row n(). If End_Date
of row 1 equals to Start_Date
of row 2, replace End_Date
of row 1 with the End_Date
of row 2 and delete row 2. Then compare row 1 with the new row 2 and so on.
Really appreciate any help or suggestions on how to go about writing a code for this. Thank you!
df <- read.table(text = "
ID Start_Date End_Date
1 09/09/2016 11/09/2016
1 11/09/2016 17/09/2016
1 17/09/2016 23/11/2016
1 23/11/2016 28/11/2016
1 28/11/2016 14/12/2016
1 14/12/2016 22/12/2016
1 22/12/2016 23/12/2016
1 25/12/2016 03/01/2017
1 06/01/2017 07/02/2017
1 07/02/2017 22/02/2017
1 22/02/2017 27/02/2017
1 28/03/2017 30/05/2017
1 12/06/2017 03/07/2017
1 03/07/2017 04/07/2017
1 14/08/2017 14/08/2017
1 15/08/2017 23/08/2017
1 23/08/2017 24/08/2017
1 24/08/2017 28/08/2017
1 07/09/2017 07/09/2017
1 24/09/2017 24/09/2017
1 01/10/2017 08/11/2017
1 08/11/2017 31/12/2017
2 08/08/2017 10/08/2017
2 10/08/2017 11/08/2017", stringsAsFactors = FALSE, header = TRUE)
The output data should look like:
df_new <- read.table(text = "
ID Start_Date End_Date
1 09/09/2016 23/12/2016
1 25/12/2016 03/01/2017
1 06/01/2017 27/02/2017
1 28/03/2017 30/05/2017
1 12/06/2017 04/07/2017
1 14/08/2017 14/08/2017
1 15/08/2017 28/08/2017
1 07/09/2017 07/09/2017
1 24/09/2017 24/09/2017
1 01/10/2017 31/12/2017
2 08/08/2017 11/08/2017", stringsAsFactors = FALSE, header = TRUE)
Upvotes: 2
Views: 364
Reputation: 354
Here's a dplyr solution:
df %>%
mutate(collapse = Start_Date == lag(End_Date),
group = cumsum(c(0, tail(!collapse, -1)))) %>%
group_by(group) %>%
summarise(Start_Date = first(Start_Date),
End_Date = last(End_Date)) %>%
select(Start_Date, End_Date)
I encourage you to run this line by line to inspect what's going on!
First, it creates a new column collapse, which is true if the start date for a record is equal to the end date of the previous record. Because there is no previous record for the first row, you'll get an NA in row number 1.
Next, a new group column is created by taking the cumulative sum of the collapse column, but the NA on row one is trimmed and replaced with a 0. Collapse needs to be negated, because we only want the value to be incremented when we encounter a row that shouldn't be merged.
Finally the group column allows us to do a simple group by and summarise. I've ignored the ID column in this case.
(This also assumes your data are nicely ordered and you're not worried about overlapping date ranges).
Upvotes: 1