More efficient match with data.table

Question

I have a massive dataset with information on hospitalizations: it includes id of hospitalization, id of physicians, id of hospital, as well as admission/discharge dates. Given that one hospitalization may involve multiple physicians, each row in the data is identified at the hospitalization id - physician id level. A physician may work in multiple hospitals. There are 92M hospitalizations, 150k physicians and 6k hospitals in my data.

I have another dataset with information on physicians' specialties. A physician may have multiple specialties.

I want to find, for each hospitalization-physician ids, the ids of all other hospitalizations in the same hospital that were concluded in the 30 days prior to the start of that given hospitalization and performed exclusively by other physicians in the same specialty.

Consider the simple example below. The sample variable added to df flags the hospitalization ids which will have at least 1 other hospitalization linked to it according to the criteria explained above.

df <- data.frame(hospitalization_id = c(1, 2, 3,
                                        1, 2, 3,
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 2,
                                  3, 3, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))
df <- df %>%
  mutate(sample = c(0,0,0,0,0,1,1,1,0,0,0))

physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
                             specialty_code = c(100, 100, 200, 200))

With the help of StackOverFlow fellows (link to original post: Find set of rows in row-specific range with restriction at different levels), I now have the following code that works perfectly fine. The issue is that the code takes forever to run. In the past 3 days it went over only 300 hospitals out of the 6k hospitals in the data.

setDT(df)
setDT(physician_spec)

peers_in_spec <- function(p) {
  physician_spec[
    physician_id != p &
      specialty_code %in% physician_spec[physician_id==p, specialty_code],
    physician_id]
}

f <- function(p, st) { 
  peers_in_spec = peers_in_spec(p)
  exclude_hosps = df_hospital[physician_id == p, unique(hospitalization_id)]
  unique(df_hospital[
    physician_id %in% peers_in_spec(p) &
      (st - date_end)>=1 & (st - date_end)<=30 & 
      !hospitalization_id %in% exclude_hosps
  ]$hospitalization_id) 
}

for(h in unique(df$hospital_id)) {
  
  print(paste0("Hospital id: ", h))
  df_hospital <- df[hospital_id==h] 
  
  tryCatch({
    output <- df_hospital[sample==1,
                .(peer_hospid = f(physician_id, date_start)), 
                .(physician_id, hospitalization_id)]
    print(output)
  }, error=function(e){cat("ERROR :",conditionMessage(e), "
")})

}

I was wondering if there is a way to make the code more efficient: either by further subsetting the original data before applying the f() function (for instance, by looping over each day-hospital and subsetting the data for the period prior to that day before applying f), or adjusting the code in some other way.

More efficient match with data.table

Answers (1)

Related Questions