Ian Parzyszek
Ian Parzyszek

Reputation: 11

A way to combine start and stop date if there is overlap or if there is less than a 5 day gap

I am dealing with individuals who received a particular treatment and I want to know the time ranges they were treated, or within 5 days of ending a treatment and starting another.

r <- read.table(text="
   ID   Start_Date     End_Date
    1     05-06-18       05-10-18
    1     05-08-18       05-14-18  
    1     05-16-18       05-25-18
    1     06-28-19       07-02-19
    1     07-03-19       07-08-19
    2     04-20-18       04-23-18
    2     07-20/18       07-25-18 
    2     07-26-18       07-30-18 
    3     05-14-17       05-29-17", 
                stringsAsFactors=FALSE, header=TRUE)

This is what I had, and this is what I want:

r <- read.table(text="
   ID   Start_Date     End_Date
    1     05-06-18     05-25-18
    1     06-28-19     07-08-19
    2     04-20-18     07-30-18 
    3     05-14-17     05-29-17", 
                stringsAsFactors=FALSE, header=TRUE)

This is the code I have tried, but doesn't quite get what I am after.

Upvotes: 0

Views: 37

Answers (2)

L Tyrone
L Tyrone

Reputation: 6885

Based on your data, I cannot determine how to achieve your example output based on your description. That's because the dates for ID == 2 have a gap > 5 days.

However, applying the logic that creates your desired outcome for the other ID values, I have assumed that a treatment period is grouped if the gap between the end of previous treatment continues to be <= 5 days. For example, if a sequence of five treatments are 1, 3, 5, 4, and 5 days apart, this 18 day period should be grouped as one row.

The workflow:

  • return the number of days between the start of one treatment and the end of the previous treatment
  • set the first row of every ID, and every gap > 5 days, to 0
  • create a column identifying the start and end of each treatment period
  • return the minimum Start_Date and maximum End_Date for each ID treatment period

The result returns two rows for ID == 2, which I believe is consistent with your explanation. If not, please update your question.

library(lubridate)
library(dplyr)

r <- read.table(text = "
   ID   Start_Date     End_Date
    1     05-06-18       05-10-18
    1     05-08-18       05-14-18  
    1     05-16-18       05-25-18
    1     06-28-19       07-02-19
    1     07-03-19       07-08-19
    2     04-20-18       04-23-18
    2     07-20-18       07-25-18 
    2     07-26-18       07-30-18 
    3     05-14-17       05-29-17",
                stringsAsFactors = FALSE, header = TRUE)

# Convert date strings to date class columns
r$Start_Date <- mdy(r$Start_Date)
r$End_Date <- mdy(r$End_Date)

# Determine start and end dates for treatment periods where treatments are 
# grouped if gap between end of previous treatment continues to be <= 5 days
r |>
  group_by(ID) |>
  mutate(tmp = as.integer(Start_Date - lag(End_Date, default = NA)),
         tmp = if_else(is.na(tmp) | tmp > 5, 0 , tmp),
         ID1 = cumsum(tmp > 5 | tmp == 0)) |>
  ungroup() |>
  summarise(Start_Date = min(Start_Date),
            End_Date = max(End_Date), .by = c(ID, ID1)) |>
  select(-ID1)

# # A tibble: 5 × 3
#      ID Start_Date End_Date  
#   <int> <date>     <date>    
# 1     1 2018-05-06 2018-05-25
# 2     1 2019-06-28 2019-07-08
# 3     2 2018-04-20 2018-04-23
# 4     2 2018-07-20 2018-07-30
# 5     3 2017-05-14 2017-05-29

Upvotes: 1

The following code gives you the very first and very last date for each group, both keeping all rows (first block of code), or collapsing into groups (second block of code)

library(dplyr)

data |> 
  group_by(ID) |> 
  mutate(First_start = min(Start_Date),
         Last_end = max(End_Date))

# A tibble: 9 × 5
# Groups:   ID [3]
#      ID Start_Date End_Date First_start Last_end
#   <int> <chr>      <chr>    <chr>       <chr>   
# 1     1 05-06-18   05-10-18 05-06-18    07-08-19
# 2     1 05-08-18   05-14-18 05-06-18    07-08-19
# 3     1 05-16-18   05-25-18 05-06-18    07-08-19
# 4     1 06-28-19   07-02-19 05-06-18    07-08-19
# 5     1 07-03-19   07-08-19 05-06-18    07-08-19
# 6     2 04-20-18   04-23-18 04-20-18    07-30-18
# 7     2 07-20/18   07-25-18 04-20-18    07-30-18
# 8     2 07-26-18   07-30-18 04-20-18    07-30-18
# 9     3 05-14-17   05-29-17 05-14-17    05-29-17

data |> 
  group_by(ID) |> 
  summarize(First_start = min(Start_Date),
         Last_end = max(End_Date))

# A tibble: 3 × 3
#      ID First_start Last_end
#   <int> <chr>       <chr>   
# 1     1 05-06-18    07-08-19
# 2     2 04-20-18    07-30-18
# 3     3 05-14-17    05-29-17

Upvotes: 0

Related Questions