Reputation: 11
I am dealing with individuals who received a particular treatment and I want to know the time ranges they were treated, or within 5 days of ending a treatment and starting another.
r <- read.table(text="
ID Start_Date End_Date
1 05-06-18 05-10-18
1 05-08-18 05-14-18
1 05-16-18 05-25-18
1 06-28-19 07-02-19
1 07-03-19 07-08-19
2 04-20-18 04-23-18
2 07-20/18 07-25-18
2 07-26-18 07-30-18
3 05-14-17 05-29-17",
stringsAsFactors=FALSE, header=TRUE)
This is what I had, and this is what I want:
r <- read.table(text="
ID Start_Date End_Date
1 05-06-18 05-25-18
1 06-28-19 07-08-19
2 04-20-18 07-30-18
3 05-14-17 05-29-17",
stringsAsFactors=FALSE, header=TRUE)
This is the code I have tried, but doesn't quite get what I am after.
Upvotes: 0
Views: 37
Reputation: 6885
Based on your data, I cannot determine how to achieve your example output based on your description. That's because the dates for ID == 2 have a gap > 5 days.
However, applying the logic that creates your desired outcome for the other ID values, I have assumed that a treatment period is grouped if the gap between the end of previous treatment continues to be <= 5 days. For example, if a sequence of five treatments are 1, 3, 5, 4, and 5 days apart, this 18 day period should be grouped as one row.
The workflow:
The result returns two rows for ID == 2, which I believe is consistent with your explanation. If not, please update your question.
library(lubridate)
library(dplyr)
r <- read.table(text = "
ID Start_Date End_Date
1 05-06-18 05-10-18
1 05-08-18 05-14-18
1 05-16-18 05-25-18
1 06-28-19 07-02-19
1 07-03-19 07-08-19
2 04-20-18 04-23-18
2 07-20-18 07-25-18
2 07-26-18 07-30-18
3 05-14-17 05-29-17",
stringsAsFactors = FALSE, header = TRUE)
# Convert date strings to date class columns
r$Start_Date <- mdy(r$Start_Date)
r$End_Date <- mdy(r$End_Date)
# Determine start and end dates for treatment periods where treatments are
# grouped if gap between end of previous treatment continues to be <= 5 days
r |>
group_by(ID) |>
mutate(tmp = as.integer(Start_Date - lag(End_Date, default = NA)),
tmp = if_else(is.na(tmp) | tmp > 5, 0 , tmp),
ID1 = cumsum(tmp > 5 | tmp == 0)) |>
ungroup() |>
summarise(Start_Date = min(Start_Date),
End_Date = max(End_Date), .by = c(ID, ID1)) |>
select(-ID1)
# # A tibble: 5 × 3
# ID Start_Date End_Date
# <int> <date> <date>
# 1 1 2018-05-06 2018-05-25
# 2 1 2019-06-28 2019-07-08
# 3 2 2018-04-20 2018-04-23
# 4 2 2018-07-20 2018-07-30
# 5 3 2017-05-14 2017-05-29
Upvotes: 1
Reputation: 1008
The following code gives you the very first and very last date for each group, both keeping all rows (first block of code), or collapsing into groups (second block of code)
library(dplyr)
data |>
group_by(ID) |>
mutate(First_start = min(Start_Date),
Last_end = max(End_Date))
# A tibble: 9 × 5
# Groups: ID [3]
# ID Start_Date End_Date First_start Last_end
# <int> <chr> <chr> <chr> <chr>
# 1 1 05-06-18 05-10-18 05-06-18 07-08-19
# 2 1 05-08-18 05-14-18 05-06-18 07-08-19
# 3 1 05-16-18 05-25-18 05-06-18 07-08-19
# 4 1 06-28-19 07-02-19 05-06-18 07-08-19
# 5 1 07-03-19 07-08-19 05-06-18 07-08-19
# 6 2 04-20-18 04-23-18 04-20-18 07-30-18
# 7 2 07-20/18 07-25-18 04-20-18 07-30-18
# 8 2 07-26-18 07-30-18 04-20-18 07-30-18
# 9 3 05-14-17 05-29-17 05-14-17 05-29-17
data |>
group_by(ID) |>
summarize(First_start = min(Start_Date),
Last_end = max(End_Date))
# A tibble: 3 × 3
# ID First_start Last_end
# <int> <chr> <chr>
# 1 1 05-06-18 07-08-19
# 2 2 04-20-18 07-30-18
# 3 3 05-14-17 05-29-17
Upvotes: 0