Reputation: 607
In R I have this object:
z <-
dplyr::tribble(
~event_id, ~group_id, ~date_event, ~date_min, ~date_max,
1, 1, "2019-11-11", "2019-11-04", "2019-11-18",
2, 1, "2019-11-13", "2019-11-06", "2019-11-20",
3, 1, "2019-11-19", "2019-11-12", "2019-11-26",
4, 1, "2020-04-30", "2020-04-23", "2020-05-07",
5, 2, "2019-11-05", "2019-10-29", "2019-11-12",
6, 2, "2019-11-26", "2019-11-19", "2019-12-03"
) %>%
dplyr::mutate_if(is.character,lubridate::as_date)
My data follows this structure: an event_id for each row; group_id for each group; a date_event; and a range that is between 7 days before and 7 days after the date_event (date_min and date_max).
What I want to know is: for each row (each event), what are the other events inside the same group_id that match the range (based on date_min and date_max) for that row?!
An output example:
I'm not sure of what output format exactly I want, but I need this result, and I cannot figure out how to solve it.
Someone could help? Thanks in advance. Best, Wlademir.
Upvotes: 2
Views: 72
Reputation: 887851
We could use a non-equi join in data.table
library(data.table)
setDT(z)[z, .N, on = .(group_id, date_min <= date_event,
date_max >= date_event), by = .EACHI]
# group_id date_min date_max N
#1: 1 2019-11-11 2019-11-11 2
#2: 1 2019-11-13 2019-11-13 3
#3: 1 2019-11-19 2019-11-19 2
#4: 1 2020-04-30 2020-04-30 1
#5: 2 2019-11-05 2019-11-05 1
#6: 2 2019-11-26 2019-11-26 1
and to return the 'dates' and the row index
setDT(z)[z, .(.N, dates = .(date_event), rn = .(.I)), on =
.(group_id, date_min <= date_event,
date_max >= date_event), by = .EACHI]
Upvotes: 1
Reputation: 389235
You can use regex join to join the data with itself and calculate the dates which are in range.
data <- fuzzyjoin::fuzzy_inner_join(z, z,
by = c('group_id', 'date_min' = 'date_event', 'date_max' = 'date_event'),
match_fun = c(`==`, `<=`, `>=`))
Now to count how many id's overlap for each id you can use :
library(dplyr)
data %>% count(event_id.x, group_id.x)
# A tibble: 6 x 3
# event_id.x group_id.x n
# <dbl> <dbl> <int>
#1 1 1 2
#2 2 1 3
#3 3 1 2
#4 4 1 1
#5 5 2 1
#6 6 2 1
and to get the id's which overlap you can use :
data %>%
group_by(event_id.x) %>%
summarise(other_ids = list(event_id.y))
# event_id.x other_ids
# <dbl> <list>
#1 1 <dbl [2]>
#2 2 <dbl [3]>
#3 3 <dbl [2]>
#4 4 <dbl [1]>
#5 5 <dbl [1]>
#6 6 <dbl [1]>
Note that each id also overlaps with itself so if you need to remove such id's from data by doing filter(event_id.x != event_id.y)
.
Upvotes: 1
Reputation: 1723
You can first group, then use purrr::map2
for this. For that to work, you pass in those date thresholds to map over simultaneously and the event id & date vectors as an additional argument:
library(dplyr)
library(purrr)
z2 <- z %>%
group_by(group_id) %>%
mutate(
event_ids_in_window = map2(
date_min, date_max,
event_ids = event_id, dates = date_event,
.f = function(date_min, date_max, event_ids, dates) {
event_ids[which(dates >= date_min & dates <= date_max)]
}
)
)
Upvotes: 1