allen.joseph
allen.joseph

Reputation: 47

How to identify the cases where the start-to-end date range intersects with a "target" date range

Picture this dataset.

library(tidyverse)
library(lubridate)

df <- data.frame(
  ID = 1:10,
  start_date_smoking = c("28/06/2011", "27/07/2012", "09/04/2003", "01/07/2013", "28/06/2011",
                         "01/09/2013", "01/09/2013", "30/08/2014", "01/09/2013", "01/09/2013"),
  end_date_smoking = c("01/01/2017", "01/01/2017", "01/01/2017", "01/01/2017", "01/09/2013",
                       "01/09/2013", "02/09/2013", "01/01/2017", "31/08/2014", "31/08/2014")
)

df <- df %>% 
  mutate(
    start_date_smoking =dmy(start_date_smoking),
    end_date_smoking =  dmy(end_date_smoking),
    days = time_length(difftime(end_date_smoking, start_date_smoking),"days"))
  )

How can I capture smokers/cases (IDs) who were smoking between 01-09-2013 and 31-08-2014 AND smoked for at least 180 days. I tried filter() but I can't get it to work. For example, I want to capture someone who started before 01-09-2013 (e.g., ID 1) since they were smoking between those two dates. However, I don't want to capture someone who starts smoking 30-08-2014 (e.g., ID 8) even if they smoke for several years since that won't be smoking for 180 days within the 01-09-2013 and 31-08-2014 timeframe.

Ultimately, I want it to return a df filtered to IDs 1, 2, 3, 4, 9, and 10.

I tried doing but it didn't filter to the rows I was expecting

start_range <- dmy("01/09/2013")
end_range <- dmy("31/08/2014")

df %>%
  filter(days >= 180 &
         (start_date_smoking >= start_range & start_date_smoking <= end_range) |
           (end_date_smoking >= start_range & end_date_smoking <= end_range) |
           (start_date_smoking <= start_range & end_date_smoking >= end_range))

EDIT 1: As far as I am aware, this hasn't been answered previously in other posts (e.g., here or here)

The problem with just doing the last line (i.e., filter(start_date_smoking <= start_range & end_date_smoking >= end_range) is that the code isn't discerning with the cases it's filtering. For example, if someone's start date is 1985-01-01 and end_date is 2014-05-15. Technically, they would have been smoking for 180 days between "2013-09-01" and their end_date 2014-05-15 (i.e. 255 days) But equally, I don't want them if their start date is 1986-12-01 and end_date is 2014-02-25 since they were only smoking for 177 days between "2013-09-01" and their end_date 2014-02-25.

EDIT 2: Thank you so much Jake and Ben. They both worked though I have a large dataset (over 15 million rows), so Ben's solution was quicker. There was some minor discrepancies but they just depend on if the 'enddate' is included in the count.

Upvotes: 1

Views: 87

Answers (1)

Jake
Jake

Reputation: 117

Editing in response to the clarification provided in comments...

We need to identify the cases where the start-to-end date range intersects with the "target" date range for at least 180 days.

target.date.range <- seq(as.Date("2013-09-01"), as.Date("2014-08-31"), by = "days")

for (i in 1:nrow(df)){
  seq.days <- seq(df$start_date_smoking[i], df$end_date_smoking[i], by = "days")
  
  df$days.in.range[i] <- length(intersect(seq.days, target.date.range)) 
}

df %>% filter(days.in.range >= 180)

This returns:

 ID start_date_smoking end_date_smoking  days days.in.range
1  1         2011-06-28       2017-01-01  2014           365
2  2         2012-07-27       2017-01-01  1619           365
3  3         2003-04-09       2017-01-01  5016           365
4  4         2013-07-01       2017-01-01  1280           365
5  9         2013-09-01       2014-08-31   364           365
6 10         2013-09-01       2014-08-31   364           365

Upvotes: 1

Related Questions