Reputation: 47
Picture this dataset.
library(tidyverse)
library(lubridate)
df <- data.frame(
ID = 1:10,
start_date_smoking = c("28/06/2011", "27/07/2012", "09/04/2003", "01/07/2013", "28/06/2011",
"01/09/2013", "01/09/2013", "30/08/2014", "01/09/2013", "01/09/2013"),
end_date_smoking = c("01/01/2017", "01/01/2017", "01/01/2017", "01/01/2017", "01/09/2013",
"01/09/2013", "02/09/2013", "01/01/2017", "31/08/2014", "31/08/2014")
)
df <- df %>%
mutate(
start_date_smoking =dmy(start_date_smoking),
end_date_smoking = dmy(end_date_smoking),
days = time_length(difftime(end_date_smoking, start_date_smoking),"days"))
)
How can I capture smokers/cases (IDs) who were smoking between 01-09-2013 and 31-08-2014 AND smoked for at least 180 days. I tried filter() but I can't get it to work. For example, I want to capture someone who started before 01-09-2013 (e.g., ID 1) since they were smoking between those two dates. However, I don't want to capture someone who starts smoking 30-08-2014 (e.g., ID 8) even if they smoke for several years since that won't be smoking for 180 days within the 01-09-2013 and 31-08-2014 timeframe.
Ultimately, I want it to return a df filtered to IDs 1, 2, 3, 4, 9, and 10.
I tried doing but it didn't filter to the rows I was expecting
start_range <- dmy("01/09/2013")
end_range <- dmy("31/08/2014")
df %>%
filter(days >= 180 &
(start_date_smoking >= start_range & start_date_smoking <= end_range) |
(end_date_smoking >= start_range & end_date_smoking <= end_range) |
(start_date_smoking <= start_range & end_date_smoking >= end_range))
EDIT 1: As far as I am aware, this hasn't been answered previously in other posts (e.g., here or here)
The problem with just doing the last line (i.e., filter(start_date_smoking <= start_range & end_date_smoking >= end_range) is that the code isn't discerning with the cases it's filtering. For example, if someone's start date is 1985-01-01 and end_date is 2014-05-15. Technically, they would have been smoking for 180 days between "2013-09-01" and their end_date 2014-05-15 (i.e. 255 days) But equally, I don't want them if their start date is 1986-12-01 and end_date is 2014-02-25 since they were only smoking for 177 days between "2013-09-01" and their end_date 2014-02-25.
EDIT 2: Thank you so much Jake and Ben. They both worked though I have a large dataset (over 15 million rows), so Ben's solution was quicker. There was some minor discrepancies but they just depend on if the 'enddate' is included in the count.
Upvotes: 1
Views: 87
Reputation: 117
Editing in response to the clarification provided in comments...
We need to identify the cases where the start-to-end date range intersects with the "target" date range for at least 180 days.
target.date.range <- seq(as.Date("2013-09-01"), as.Date("2014-08-31"), by = "days")
for (i in 1:nrow(df)){
seq.days <- seq(df$start_date_smoking[i], df$end_date_smoking[i], by = "days")
df$days.in.range[i] <- length(intersect(seq.days, target.date.range))
}
df %>% filter(days.in.range >= 180)
This returns:
ID start_date_smoking end_date_smoking days days.in.range
1 1 2011-06-28 2017-01-01 2014 365
2 2 2012-07-27 2017-01-01 1619 365
3 3 2003-04-09 2017-01-01 5016 365
4 4 2013-07-01 2017-01-01 1280 365
5 9 2013-09-01 2014-08-31 364 365
6 10 2013-09-01 2014-08-31 364 365
Upvotes: 1