PaulaSpinola
PaulaSpinola

Reputation: 531

More efficient match with data.table

I have a massive dataset with information on hospitalizations: it includes id of hospitalization, id of physicians, id of hospital, as well as admission/discharge dates. Given that one hospitalization may involve multiple physicians, each row in the data is identified at the hospitalization id - physician id level. A physician may work in multiple hospitals. There are 92M hospitalizations, 150k physicians and 6k hospitals in my data.

I have another dataset with information on physicians' specialties. A physician may have multiple specialties.

I want to find, for each hospitalization-physician ids, the ids of all other hospitalizations in the same hospital that were concluded in the 30 days prior to the start of that given hospitalization and performed exclusively by other physicians in the same specialty.

Consider the simple example below. The sample variable added to df flags the hospitalization ids which will have at least 1 other hospitalization linked to it according to the criteria explained above.

df <- data.frame(hospitalization_id = c(1, 2, 3,
                                        1, 2, 3,
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 2,
                                  3, 3, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))
df <- df %>%
  mutate(sample = c(0,0,0,0,0,1,1,1,0,0,0))

physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
                             specialty_code = c(100, 100, 200, 200))

With the help of StackOverFlow fellows (link to original post: Find set of rows in row-specific range with restriction at different levels), I now have the following code that works perfectly fine. The issue is that the code takes forever to run. In the past 3 days it went over only 300 hospitals out of the 6k hospitals in the data.

setDT(df)
setDT(physician_spec)

peers_in_spec <- function(p) {
  physician_spec[
    physician_id != p &
      specialty_code %in% physician_spec[physician_id==p, specialty_code],
    physician_id]
}

f <- function(p, st) { 
  peers_in_spec = peers_in_spec(p)
  exclude_hosps = df_hospital[physician_id == p, unique(hospitalization_id)]
  unique(df_hospital[
    physician_id %in% peers_in_spec(p) &
      (st - date_end)>=1 & (st - date_end)<=30 & 
      !hospitalization_id %in% exclude_hosps
  ]$hospitalization_id) 
}

for(h in unique(df$hospital_id)) {
  
  print(paste0("Hospital id: ", h))
  df_hospital <- df[hospital_id==h] 
  
  tryCatch({
    output <- df_hospital[sample==1,
                .(peer_hospid = f(physician_id, date_start)), 
                .(physician_id, hospitalization_id)]
    print(output)
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})

}

I was wondering if there is a way to make the code more efficient: either by further subsetting the original data before applying the f() function (for instance, by looping over each day-hospital and subsetting the data for the period prior to that day before applying f), or adjusting the code in some other way.

Upvotes: 0

Views: 89

Answers (1)

langtang
langtang

Reputation: 24722

Here is one alternative approach:

# load libraries
library(data.table)
library(magrittr)

# set as data.table
setDT(physician_spec)
setDT(df)


# Create a physician match table.. for each physician, which other physicians are matched by speciality?
phys_match = physician_spec[physician_spec, on="specialty_code"] %>% 
  .[physician_id!=i.physician_id, .(mds = list(i.physician_id)),physician_id]

# Merge df on itself, using a join on hospital_id, and non-equi join re the start_date
k <- df[df[,.(hospital_id,h_id = hospitalization_id, date_end,e=date_end+30,other_md = physician_id)],
   on=.(hospital_id, date_start>date_end, date_start<e), nomatch=0] %>% 
  .[, .(hospitalization_id, h_id, hospital_id, physician_id, other_md)] %>% 
  .[phys_match,on="physician_id"]

# add speciality match boolean, and keep if this match is true
k[, spec_match:=other_md %in% mds[[1]], 1:nrow(k)]

# helper function checks: if the physician_id value (which is constant, so use p[1])
# is in o, then we return False, otherwise we check if among the rows where speciality matches
# there is a set difference of length>0

f <- function(p,o,m) {
  fifelse((p[1] %in% o),F,length(setdiff(o[m],p[m]))>0)
}

k[, f(physician_id, other_md,spec_match), .(hospitalization_id, h_id)][V1==TRUE][, V1:=NULL][]

Output:

   hospitalization_id h_id
1:                  3    4
2:                  4    1
3:                  5    1
4:                  5    2

Upvotes: 1

Related Questions