Reputation: 1643

Find n overlapping dates within n number of days

I'm looking to find >=4 unique events that all occur within a group within a 90 day period and then flag the ID.

Just a test example:

library(dplyr)

set.seed(1)

test <- data.frame(
  PATID = sample(1:1e4, 1e5, replace = TRUE),
  PROV = sample(1:50, 1e5, replace = TRUE),
  GROUP = sample(0:1, 1e5, replace = TRUE),
  DATE = as.Date(sample(
    as.Date("2020-01-01"):as.Date("2020-12-31"),
    1e5,
    replace = TRUE
  ), origin = "1970-01-01")
)

If we look at PATID==5 we can see there are 4 unique PROVs with overlapping dates within 90 days and within our group of interest and so should be flagged.

> test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
   PATID PROV GROUP       DATE
1      5    2     0 2020-05-07
2      5    3     0 2020-05-20
3      5    3     0 2020-11-15
4      5   49     0 2020-12-14
5      5   45     1 2020-02-16
6      5   50     1 2020-03-19
7      5   38     1 2020-03-25
8      5   27     1 2020-03-29
9      5   42     1 2020-08-30
10     5   46     1 2020-11-03
11     5   25     1 2020-11-13
12     5   29     1 2020-12-26
> as.Date("2020-03-29")-as.Date("2020-02-16")<=90
[1] TRUE

Ultimately, I'm looking for the proportion of GROUP==1 vs GROUP==0 with >=4 unique PROVs within 90 days. Ideally I'd prefer using data.table simply due to the scale of data.

Trying out some code:

test %>%
  filter(PATID %in% 1:5) %>%
  group_by(PATID,GROUP) %>%
  arrange(GROUP, DATE) %>%
  mutate(lag = DATE - lag(DATE),
         day_count = case_when(lag <= 90 ~ TRUE,
                               is.na(lag) ~ TRUE,
                               TRUE ~ FALSE)) %>%
  mutate(crit = cumsum_reset(day_count)) %>% 
  ungroup() %>%
  group_by(PATID) %>%
  mutate(flag = case_when(max(crit) >= 4 ~ 1,
                          TRUE ~ 0)) %>% 
  arrange(PATID)

Getting closer, just need to sort out the 90 window versus just crudely testing if each date is within 90 days.

Upvotes: 1

Answers (3)

Marcelo Avila

Reputation: 2374

Maybe the following is what you are after. Please check if the logic is what you meant. I left more explicit than necessary so that the idea can be more easily understood. The main idea is that if after sorting there is a observation from same PATDI & GROUP that is within 90 days from the 3rd lag diff_3 := DATE - shift(DATE, 3), than it should be flagged. This is done by checking diff_check = diff_3<=90. If any observation for any PATID/GROUP is flagged, the corresponding ID will be flagged by the keep = max(diff_check, na.rm = TRUE, pmin = 0) after grouping by only PATID.

Using the third lag to account for 4 or more and not strictly more than 4 observations.
Does it, all in all, make any sense?

library(data.table)
set.seed(1)
test <- data.frame(
  PATID = sample(1:1e4, 1e5, replace = TRUE),
  PROV = sample(1:50, 1e5, replace = TRUE),
  GROUP = sample(0:1, 1e5, replace = TRUE),
  DATE = as.Date(sample(
    as.Date("2020-01-01"):as.Date("2020-12-31"),
    1e5,
    replace = TRUE
  ), origin = "1970-01-01")
)
test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
#> Error in test %>% filter(PATID == 5) %>% arrange(GROUP, DATE): could not find function "%>%"

dt <- as.data.table(test)
dt <- dt[order(PATID, GROUP, DATE)]
dt[, diff_3 := DATE - shift(DATE, 3), by = c("PATID", "GROUP")]

# check amount of unique values of PROV in previous 4 observations
dt[, unique_last_4 := frollapply(x = PROV, n = 4, FUN = uniqueN), by = c("PATID", "GROUP")]

# check if within 90 days and unique PROVs 
dt[, diff_check := diff_3<=90 & unique_last_4==4, by = c("PATID", "GROUP")]

# final check to flag all observations of ID that satisfied at least once the above checks
dt[, to_keep := max(diff_check, na.rm = TRUE, pmin = 0), by = "PATID"]
# NOTE: unsure if you mean to group only by PATID here or by PATID & GROUP.

head(dt[to_keep==1], 20)
#>     PATID PROV GROUP       DATE   diff_3 unique_last_4 diff_check to_keep
#>  1:     5    2     0 2020-05-07  NA days            NA         NA       1
#>  2:     5    3     0 2020-05-20  NA days            NA         NA       1
#>  3:     5    3     0 2020-11-15  NA days            NA         NA       1
#>  4:     5   49     0 2020-12-14 221 days             3      FALSE       1
#>  5:     5   45     1 2020-02-16  NA days            NA         NA       1
#>  6:     5   50     1 2020-03-19  NA days            NA         NA       1
#>  7:     5   38     1 2020-03-25  NA days            NA         NA       1
#>  8:     5   27     1 2020-03-29  42 days             4       TRUE       1
#>  9:     5   42     1 2020-08-30 164 days             4      FALSE       1
#> 10:     5   46     1 2020-11-03 223 days             4      FALSE       1
#> 11:     5   25     1 2020-11-13 229 days             4      FALSE       1
#> 12:     5   29     1 2020-12-26 118 days             4      FALSE       1
#> 13:     7    1     0 2020-04-10  NA days            NA         NA       1
#> 14:     7   44     0 2020-04-29  NA days            NA         NA       1
#> 15:     7   27     0 2020-05-05  NA days            NA         NA       1
#> 16:     7   41     0 2020-06-11  62 days             4       TRUE       1
#> 17:     7   35     0 2020-06-30  62 days             4       TRUE       1
#> 18:     7   11     0 2020-12-18 227 days             4      FALSE       1
#> 19:     7   24     1 2020-12-24  NA days            NA         NA       1
#> 20:     7   13     1 2020-12-29  NA days            NA         NA       1

^{Created on 2021-06-22 by the reprex package (v2.0.0)}

dplyr version

test_keep <- test %>% arrange(PATID, GROUP, DATE) %>%
  head(1000) %>% # otherwise it takes too long in my pc, which shows data.table's efficiency! 
  group_by(PATID, GROUP) %>%
  mutate(diff_3 = DATE - lag(DATE, 3),
         diff_check = diff_3<=90, 
         unique_last_4 = frollapply(x = PROV, n = 4, FUN = uniqueN)
  ) %>% group_by(PATID) %>%
  mutate(keep = max(diff_check, na.rm = TRUE, pmin = 0)) %>%
  arrange(PATID, GROUP)

test_keep %>% filter(keep==1) %>% head(20)

Upvotes: 1

Ian Gow

Reputation: 3535

There's some ambiguities in the question, so this may not be quite right. I tried doing this using dplyr and local data frames, but the self-join causes an overflow (100,000 times 100,000). It seems to work using data.table and using PostgreSQL, which has an OVERLAPS function. (Note that I used lower-case variable names to make working with SQL easier.)

In the answer below, I start with a patient visit ((patid, prov, group, date) combination) and look forward 90 days to capture all visits by that patient (patid) to other providers (prov != prov_other). I then count the number of distinct providers in that lookahead period (this will be NA when there are no visits, as when looking at a patient's last visit in the sample). I then count the number of visits where the number of additional distinct providers in the subsequent 90 days is 3 or more.

Finally, I group by (group, year) and count the proportion of visits that are followed by visits to at least three other providers during the subsequent 90 days. Given the way the data are generated, it is no surprise that the two groups look similar on this metric.

Note that each patient visit forms a unit of observation here. In practice, it may make sense to aggregate by (say) (patid, year) before calculating statistics or do some other kind of aggregation.

library(data.table)
library(dplyr, warn.conflicts = FALSE)

set.seed(1)

test <- tibble(
    patid = sample(1:1e4, 1e5, replace = TRUE),
    prov = sample(1:50, 1e5, replace = TRUE),
    group = sample(0:1, 1e5, replace = TRUE),
    date = as.Date(sample(
        as.Date("2020-01-01"):as.Date("2020-12-31"),
        1e5,
        replace = TRUE
    ), origin = "1970-01-01")) %>%
    as.data.table()
test 
#>         patid prov group       date
#>      1:  1017    6     1 2020-08-03
#>      2:  8004   34     0 2020-12-15
#>      3:  4775   32     0 2020-06-21
#>      4:  9725   47     1 2020-09-25
#>      5:  8462   15     0 2020-03-05
#>     ---                            
#>  99996:   949   47     0 2020-07-05
#>  99997:  2723   37     0 2020-08-18
#>  99998:   201   27     1 2020-01-06
#>  99999:   163    9     0 2020-03-06
#> 100000:  3204   48     1 2020-11-17

df_overlap <- 
    test %>% 
    inner_join(test, by = "patid", suffix = c("", "_other")) %>%
    filter(prov != prov_other) %>%
    filter(date_other >= date & date_other <= date + 90L)

mt_4_provs_df <-
    df_overlap %>% 
    group_by(patid, prov, group, date) %>%
    summarize(n_providers = n_distinct(prov_other), .groups = "drop")

results <- 
    test %>%
    left_join(mt_4_provs_df, by = c("patid", "prov", "group", "date")) %>%
    mutate(mt_4_provs = n_providers >= 3,
           year = year(date)) %>%
    group_by(group, year) %>%
    summarize(prop_mt_4_provs = mean(mt_4_provs, na.rm = TRUE),
              .groups = "drop")
    
results
#> # A tibble: 2 x 3
#>   group  year prop_mt_4_provs
#>   <int> <int>           <dbl>
#> 1     0  2020           0.426
#> 2     1  2020           0.423

^{Created on 2021-06-22 by the reprex package (v2.0.0)}

Upvotes: 0

chinsoon12

Reputation: 25223

Based on I'm looking for the annual "group" proportion of patients that visit >=4 providers within 90 days, you can try this:

library(data.table) #data.table 1.13.2
setDT(test)[, c("d90ago", "d90aft") := .(DATE - 90L, DATE + 90L)]
setkey(test, PATID, DATE)
test[, grp := 
    .SD[.SD, on=.(PATID, DATE>=d90ago, DATE<=d90aft), by=.EACHI, +(length(unique(x.PROV))>=4L)]$V1
]

The above allows PROV within overlapping windows of 90 days to be re-used.

Upvotes: 1

Find n overlapping dates within n number of days

Answers (3)

dplyr version

Related Questions