Reputation: 75
I would like to create a new variable that indicates whether a visit_date is within any of the date ranges listed for an id
I've used this code to get row by row comparisons, but I want to expand this and compare all rows of an id to all rows of intervals listed for that id
df <- df %>%
group_by(id) %>%
mutate(between_any = ifelse((visit_date >= start & visit_date <= end), 1))
I have also tried creating an interval variable and using crossing(visit_date, interval) before mutating, however I could not get crossing to work for a date object.
Here is some example data:
df <- data.frame(id = c("a","a","a","a","a","b","b","b"),
visit_date = c("2001-08-22","2001-09-21","2001-10-30","2001-11-10","2001-12-20","2002-12-22", "2003-04-30","2003-05-10"),
start = c(NA,"2001-09-21",NA,"2001-11-10",NA,"2002-12-22", "2003-04-30",NA),
end = c(NA, "2001-11-01",NA,"2001-11-10",NA,"2002-12-22","2003-06-01",NA))
> df
id visit_date start end
a 2001-08-22 <NA> <NA>
a 2001-09-21 2001-09-21 2001-11-01
a 2001-10-30 <NA> <NA>
a 2001-11-10 2001-11-10 2001-11-10
a 2001-12-20 <NA> <NA>
b 2002-12-22 2002-12-22 2002-12-22
b 2003-04-30 2003-04-30 2003-06-01
b 2003-05-10 <NA> <NA>
My desired output is the following:
id visit_date start end between_any
a 2001-08-22 <NA> <NA> 0
a 2001-09-21 2001-09-21 2001-11-01 1
a 2001-10-30 <NA> <NA> 1
a 2001-11-10 2001-11-10 2001-11-10 1
a 2001-12-20 <NA> <NA> 0
b 2002-12-22 2002-12-22 2002-12-22 1
b 2003-04-30 2003-04-30 2003-06-01 1
b 2003-05-10 <NA> <NA> 1
Thanks in advance!
Upvotes: 2
Views: 853
Reputation: 39858
A different possibility could be:
df %>%
rowid_to_column() %>%
full_join(df %>%
filter(!is.na(start) & !is.na(end)) %>%
mutate(interval = interval(ymd(start), ymd(end))) %>%
select(id, interval), by = c("id" = "id")) %>%
group_by(rowid, id) %>%
summarise(between_any = max(ymd(visit_date) %within% interval * 1)) %>%
left_join(df %>%
rowid_to_column(), by = c("rowid" = "rowid",
"id" = "id")) %>%
ungroup() %>%
select(-rowid)
id between_any visit_date start end
<fct> <dbl> <fct> <fct> <fct>
1 a 0 2001-11-08 <NA> <NA>
2 a 1 2001-09-21 2001-09-21 2001-11-01
3 a 1 2001-10-30 <NA> <NA>
4 a 1 2001-11-10 2001-11-10 2001-11-10
5 a 0 2001-12-20 <NA> <NA>
6 b 1 2002-12-22 2002-12-22 2002-12-22
7 b 1 2003-04-30 2003-04-30 2003-06-01
8 b 1 2003-05-10 <NA> <NA>
Here it, first, creates the interval variable and then performs a full join based on "id". Second, it checks whether "visit_date" is within any of the intervals per "id" and "rowid". Finally, it joins the results with the original data.
Upvotes: 0
Reputation: 2636
in_range
function in the data.table
package does exactly this...
library(data.table)
df <- df %>%
group_by(id) %>%
mutate(between_any = as.numeric((inrange(visit_date, start, end))))
#> df
# id visit_date start end between_any
#1 a 2001-08-22 <NA> <NA> 0
#2 a 2001-09-21 2001-09-21 2001-11-01 1
#3 a 2001-10-30 <NA> <NA> 1
#4 a 2001-11-10 2001-11-10 2001-11-10 1
#5 a 2001-12-20 <NA> <NA> 0
#6 b 2002-12-22 2002-12-22 2002-12-22 1
#7 b 2003-04-30 2003-04-30 2003-06-01 1
#8 b 2003-05-10 <NA> <NA> 1
In data.table form...
dt <- setDT(df)
dt[, between_any := inrange(visit_date, start, end),
by = id]
Upvotes: 3
Reputation: 3178
My answer isn't as "pretty" as I would like, but it gets you where you are looking to go.
I first convert your dates to dates:
library(lubridate)
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
df <- data.frame(id = c("a","a","a","a","a","b","b","b"),
visit_date = c("2001-08-22","2001-09-21","2001-10-30","2001-11-10","2001-12-20","2002-12-22", "2003-04-30","2003-05-10"),
start = c(NA,"2001-09-21",NA,"2001-11-10",NA,"2002-12-22", "2003-04-30",NA),
end = c(NA, "2001-11-01",NA,"2001-11-10",NA,"2002-12-22","2003-06-01",NA)) %>%
mutate_at(-1,as.Date)
> df
id visit_date start end
1 a 2001-08-22 <NA> <NA>
2 a 2001-09-21 2001-09-21 2001-11-01
3 a 2001-10-30 <NA> <NA>
4 a 2001-11-10 2001-11-10 2001-11-10
5 a 2001-12-20 <NA> <NA>
6 b 2002-12-22 2002-12-22 2002-12-22
7 b 2003-04-30 2003-04-30 2003-06-01
8 b 2003-05-10 <NA> <NA>
Next I create a list of intervals for each group:
df_intervals <- df %>%
mutate_at(-1,as.Date) %>%
filter(!is.na(start),
!is.na(end)) %>%
mutate(interval = start %--% end) %>%
select(id,interval) %>%
group_by(id)
> df_intervals
# A tibble: 4 x 2
# Groups: id [2]
id interval
<fct> <S4: Interval>
1 a 2001-09-21 UTC--2001-11-01 UTC
2 a 2001-11-10 UTC--2001-11-10 UTC
3 b 2002-12-22 UTC--2002-12-22 UTC
4 b 2003-04-30 UTC--2003-06-01 UTC
Lastly, I join the intervals data to the original data based on id
and search for the visit_date
inside the intervals.
df_output <- df %>% as.tbl() %>%
inner_join(df_intervals) %>%
mutate(between_any = map2_lgl(visit_date,interval,~ .x >= int_start(.y) & .x <= int_end(.y))) %>%
group_by(id,visit_date,start,end) %>%
summarise(between_any = as.numeric(any(between_any)))
> df_output
# A tibble: 8 x 5
# Groups: id, visit_date, start [8]
id visit_date start end between_any
<fct> <date> <date> <date> <dbl>
1 a 2001-08-22 NA NA 0
2 a 2001-09-21 2001-09-21 2001-11-01 1
3 a 2001-10-30 NA NA 1
4 a 2001-11-10 2001-11-10 2001-11-10 1
5 a 2001-12-20 NA NA 0
6 b 2002-12-22 2002-12-22 2002-12-22 1
7 b 2003-04-30 2003-04-30 2003-06-01 1
8 b 2003-05-10 NA NA 1
Upvotes: 2