In R, how to find other rows in the same group_by that belong to a date range defined by row?

In R I have this object:

z <- 
  dplyr::tribble(
    ~event_id, ~group_id, ~date_event, ~date_min,   ~date_max,  
    1, 1, "2019-11-11", "2019-11-04", "2019-11-18",
    2, 1, "2019-11-13", "2019-11-06", "2019-11-20",
    3, 1, "2019-11-19", "2019-11-12", "2019-11-26",
    4, 1, "2020-04-30", "2020-04-23", "2020-05-07",
    5, 2, "2019-11-05",  "2019-10-29", "2019-11-12",
    6, 2, "2019-11-26", "2019-11-19", "2019-12-03"
    ) %>%
  dplyr::mutate_if(is.character,lubridate::as_date)

My data follows this structure: an event_id for each row; group_id for each group; a date_event; and a range that is between 7 days before and 7 days after the date_event (date_min and date_max).

What I want to know is: for each row (each event), what are the other events inside the same group_id that match the range (based on date_min and date_max) for that row?!

An output example:

I'm not sure of what output format exactly I want, but I need this result, and I cannot figure out how to solve it.

Someone could help? Thanks in advance. Best, Wlademir.

Upvotes: 2

Views: 72

Answers (3)

akrun
akrun

Reputation: 887851

We could use a non-equi join in data.table

library(data.table)
setDT(z)[z, .N, on = .(group_id, date_min <= date_event,
         date_max >= date_event), by = .EACHI]
#    group_id   date_min   date_max N
#1:        1 2019-11-11 2019-11-11 2
#2:        1 2019-11-13 2019-11-13 3
#3:        1 2019-11-19 2019-11-19 2
#4:        1 2020-04-30 2020-04-30 1
#5:        2 2019-11-05 2019-11-05 1
#6:        2 2019-11-26 2019-11-26 1

and to return the 'dates' and the row index

setDT(z)[z, .(.N, dates = .(date_event), rn = .(.I)), on = 
    .(group_id, date_min <= date_event,
        date_max >= date_event), by = .EACHI]

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389235

You can use regex join to join the data with itself and calculate the dates which are in range.

data <- fuzzyjoin::fuzzy_inner_join(z, z, 
       by = c('group_id', 'date_min' = 'date_event', 'date_max' = 'date_event'), 
       match_fun = c(`==`, `<=`, `>=`))

Now to count how many id's overlap for each id you can use :

library(dplyr)
data %>% count(event_id.x, group_id.x)

# A tibble: 6 x 3
#  event_id.x group_id.x     n
#       <dbl>      <dbl> <int>
#1          1          1     2
#2          2          1     3
#3          3          1     2
#4          4          1     1
#5          5          2     1
#6          6          2     1

and to get the id's which overlap you can use :

data %>%
  group_by(event_id.x) %>%
  summarise(other_ids = list(event_id.y))

#  event_id.x other_ids
#       <dbl> <list>   
#1          1 <dbl [2]>
#2          2 <dbl [3]>
#3          3 <dbl [2]>
#4          4 <dbl [1]>
#5          5 <dbl [1]>
#6          6 <dbl [1]>

Note that each id also overlaps with itself so if you need to remove such id's from data by doing filter(event_id.x != event_id.y).

Upvotes: 1

alex_jwb90
alex_jwb90

Reputation: 1723

You can first group, then use purrr::map2 for this. For that to work, you pass in those date thresholds to map over simultaneously and the event id & date vectors as an additional argument:

library(dplyr)
library(purrr)

z2 <- z %>%
  group_by(group_id) %>%
  mutate(
    event_ids_in_window = map2(
      date_min, date_max,
      event_ids = event_id, dates = date_event,
      .f = function(date_min, date_max, event_ids, dates) {
        event_ids[which(dates >= date_min & dates <= date_max)]
      }
    )
  )

Upvotes: 1

Related Questions