Reputation: 1643
I'm looking to find >=4 unique events that all occur within a group within a 90 day period and then flag the ID.
Just a test example:
library(dplyr)
set.seed(1)
test <- data.frame(
PATID = sample(1:1e4, 1e5, replace = TRUE),
PROV = sample(1:50, 1e5, replace = TRUE),
GROUP = sample(0:1, 1e5, replace = TRUE),
DATE = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")
)
If we look at PATID==5
we can see there are 4 unique PROV
s with overlapping dates within 90 days and within our group of interest and so should be flagged.
> test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
PATID PROV GROUP DATE
1 5 2 0 2020-05-07
2 5 3 0 2020-05-20
3 5 3 0 2020-11-15
4 5 49 0 2020-12-14
5 5 45 1 2020-02-16
6 5 50 1 2020-03-19
7 5 38 1 2020-03-25
8 5 27 1 2020-03-29
9 5 42 1 2020-08-30
10 5 46 1 2020-11-03
11 5 25 1 2020-11-13
12 5 29 1 2020-12-26
> as.Date("2020-03-29")-as.Date("2020-02-16")<=90
[1] TRUE
Ultimately, I'm looking for the proportion of GROUP==1
vs GROUP==0
with >=4 unique PROVs
within 90 days. Ideally I'd prefer using data.table simply due to the scale of data.
Trying out some code:
test %>%
filter(PATID %in% 1:5) %>%
group_by(PATID,GROUP) %>%
arrange(GROUP, DATE) %>%
mutate(lag = DATE - lag(DATE),
day_count = case_when(lag <= 90 ~ TRUE,
is.na(lag) ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(crit = cumsum_reset(day_count)) %>%
ungroup() %>%
group_by(PATID) %>%
mutate(flag = case_when(max(crit) >= 4 ~ 1,
TRUE ~ 0)) %>%
arrange(PATID)
Getting closer, just need to sort out the 90 window versus just crudely testing if each date is within 90 days.
Upvotes: 1
Views: 176
Reputation: 2374
Maybe the following is what you are after. Please check if the logic is what you meant. I left more explicit than necessary so that the idea can be more easily understood. The main idea is that if after sorting there is a observation from same PATDI & GROUP that is within 90 days from the 3rd lag diff_3 := DATE - shift(DATE, 3)
, than it should be flagged. This is done by checking diff_check = diff_3<=90
. If any observation for any PATID/GROUP is flagged, the corresponding ID will be flagged by the keep = max(diff_check, na.rm = TRUE, pmin = 0)
after grouping by only PATID.
library(data.table)
set.seed(1)
test <- data.frame(
PATID = sample(1:1e4, 1e5, replace = TRUE),
PROV = sample(1:50, 1e5, replace = TRUE),
GROUP = sample(0:1, 1e5, replace = TRUE),
DATE = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")
)
test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
#> Error in test %>% filter(PATID == 5) %>% arrange(GROUP, DATE): could not find function "%>%"
dt <- as.data.table(test)
dt <- dt[order(PATID, GROUP, DATE)]
dt[, diff_3 := DATE - shift(DATE, 3), by = c("PATID", "GROUP")]
# check amount of unique values of PROV in previous 4 observations
dt[, unique_last_4 := frollapply(x = PROV, n = 4, FUN = uniqueN), by = c("PATID", "GROUP")]
# check if within 90 days and unique PROVs
dt[, diff_check := diff_3<=90 & unique_last_4==4, by = c("PATID", "GROUP")]
# final check to flag all observations of ID that satisfied at least once the above checks
dt[, to_keep := max(diff_check, na.rm = TRUE, pmin = 0), by = "PATID"]
# NOTE: unsure if you mean to group only by PATID here or by PATID & GROUP.
head(dt[to_keep==1], 20)
#> PATID PROV GROUP DATE diff_3 unique_last_4 diff_check to_keep
#> 1: 5 2 0 2020-05-07 NA days NA NA 1
#> 2: 5 3 0 2020-05-20 NA days NA NA 1
#> 3: 5 3 0 2020-11-15 NA days NA NA 1
#> 4: 5 49 0 2020-12-14 221 days 3 FALSE 1
#> 5: 5 45 1 2020-02-16 NA days NA NA 1
#> 6: 5 50 1 2020-03-19 NA days NA NA 1
#> 7: 5 38 1 2020-03-25 NA days NA NA 1
#> 8: 5 27 1 2020-03-29 42 days 4 TRUE 1
#> 9: 5 42 1 2020-08-30 164 days 4 FALSE 1
#> 10: 5 46 1 2020-11-03 223 days 4 FALSE 1
#> 11: 5 25 1 2020-11-13 229 days 4 FALSE 1
#> 12: 5 29 1 2020-12-26 118 days 4 FALSE 1
#> 13: 7 1 0 2020-04-10 NA days NA NA 1
#> 14: 7 44 0 2020-04-29 NA days NA NA 1
#> 15: 7 27 0 2020-05-05 NA days NA NA 1
#> 16: 7 41 0 2020-06-11 62 days 4 TRUE 1
#> 17: 7 35 0 2020-06-30 62 days 4 TRUE 1
#> 18: 7 11 0 2020-12-18 227 days 4 FALSE 1
#> 19: 7 24 1 2020-12-24 NA days NA NA 1
#> 20: 7 13 1 2020-12-29 NA days NA NA 1
Created on 2021-06-22 by the reprex package (v2.0.0)
test_keep <- test %>% arrange(PATID, GROUP, DATE) %>%
head(1000) %>% # otherwise it takes too long in my pc, which shows data.table's efficiency!
group_by(PATID, GROUP) %>%
mutate(diff_3 = DATE - lag(DATE, 3),
diff_check = diff_3<=90,
unique_last_4 = frollapply(x = PROV, n = 4, FUN = uniqueN)
) %>% group_by(PATID) %>%
mutate(keep = max(diff_check, na.rm = TRUE, pmin = 0)) %>%
arrange(PATID, GROUP)
test_keep %>% filter(keep==1) %>% head(20)
Upvotes: 1
Reputation: 3535
There's some ambiguities in the question, so this may not be quite right. I tried doing this using dplyr
and local data frames, but the self-join causes an overflow (100,000 times 100,000).
It seems to work using data.table
and using PostgreSQL, which has an OVERLAPS
function.
(Note that I used lower-case variable names to make working with SQL easier.)
In the answer below, I start with a patient visit ((patid, prov, group, date)
combination) and look forward 90 days to capture all visits by that patient (patid
) to other providers (prov != prov_other
). I then count the number of distinct providers in that lookahead period (this will be NA
when there are no visits, as when looking at a patient's last visit in the sample). I then count the number of visits where the number of additional distinct providers in the subsequent 90 days is 3 or more.
Finally, I group by (group, year)
and count the proportion of visits that are followed by visits to at least three other providers during the subsequent 90 days. Given the way the data are generated, it is no surprise that the two groups look similar on this metric.
Note that each patient visit forms a unit of observation here. In practice, it may make sense to aggregate by (say) (patid, year)
before calculating statistics or do some other kind of aggregation.
library(data.table)
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
test <- tibble(
patid = sample(1:1e4, 1e5, replace = TRUE),
prov = sample(1:50, 1e5, replace = TRUE),
group = sample(0:1, 1e5, replace = TRUE),
date = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")) %>%
as.data.table()
test
#> patid prov group date
#> 1: 1017 6 1 2020-08-03
#> 2: 8004 34 0 2020-12-15
#> 3: 4775 32 0 2020-06-21
#> 4: 9725 47 1 2020-09-25
#> 5: 8462 15 0 2020-03-05
#> ---
#> 99996: 949 47 0 2020-07-05
#> 99997: 2723 37 0 2020-08-18
#> 99998: 201 27 1 2020-01-06
#> 99999: 163 9 0 2020-03-06
#> 100000: 3204 48 1 2020-11-17
df_overlap <-
test %>%
inner_join(test, by = "patid", suffix = c("", "_other")) %>%
filter(prov != prov_other) %>%
filter(date_other >= date & date_other <= date + 90L)
mt_4_provs_df <-
df_overlap %>%
group_by(patid, prov, group, date) %>%
summarize(n_providers = n_distinct(prov_other), .groups = "drop")
results <-
test %>%
left_join(mt_4_provs_df, by = c("patid", "prov", "group", "date")) %>%
mutate(mt_4_provs = n_providers >= 3,
year = year(date)) %>%
group_by(group, year) %>%
summarize(prop_mt_4_provs = mean(mt_4_provs, na.rm = TRUE),
.groups = "drop")
results
#> # A tibble: 2 x 3
#> group year prop_mt_4_provs
#> <int> <int> <dbl>
#> 1 0 2020 0.426
#> 2 1 2020 0.423
Created on 2021-06-22 by the reprex package (v2.0.0)
Upvotes: 0
Reputation: 25223
Based on I'm looking for the annual "group" proportion of patients that visit >=4 providers within 90 days, you can try this:
library(data.table) #data.table 1.13.2
setDT(test)[, c("d90ago", "d90aft") := .(DATE - 90L, DATE + 90L)]
setkey(test, PATID, DATE)
test[, grp :=
.SD[.SD, on=.(PATID, DATE>=d90ago, DATE<=d90aft), by=.EACHI, +(length(unique(x.PROV))>=4L)]$V1
]
The above allows PROV within overlapping windows of 90 days to be re-used.
Upvotes: 1