ano
ano

Reputation: 704

Applying rules in sequential order to implement hierarchy

I'm trying to apply a series of rules and change values if the conditions are met. Ideally, I'd like to avoid loops, and have a large data set (12.5 million observations), so efficiency would be nice (but this only needs be done once, so efficiency not critical).

The sample data looks like this. Each id can be present on multiple dates, on each of those dates can have multiple cc

set.seed(1)
id <- 1:200
date <- sample(seq(as.Date('2007/01/01'), as.Date('2012/12/31'), by="day"), 1000, replace=T)
cc <- sample(1:150, 1000, replace=T)
df <- data.frame(id, date, cc)
head(df)

id      date  cc
1 2011-04-28  69
2 2007-05-31 107
3 2012-08-02 100
4 2011-07-04  98
5 2010-08-20 147
6 2009-02-28  51

The hierarchy rules have been set up like this:

year <- rep(2007:2010, each=8)
ifcc <- c(5,7,8,9,15,16,17,18)
r1 <- c(112,NA,NA,NA,NA,NA)
r2 <- c(8,9,10,NA,NA,NA)
r3 <- c(9,10,NA,NA,NA,NA)
r4 <- c(10,NA,NA,NA,NA,NA)
r5 <- c(16,17,18,19,NA,NA)
r6 <- c(17,18,19,NA,NA,NA)
r7 <- c(18,19,NA,NA,NA,NA)
r8 <- c(19,NA,NA,NA,NA,NA)

h <- rbind(r1, r2, r3, r4, r5, r6, r7, r8)
h <- cbind(ifcc, h)
h <- data.frame(year, h)

head(h)
year ifcc  V2 V3 V4 V5 V6 V7
2007    5 112 NA NA NA NA NA
2007    7   8  9 10 NA NA NA
2007    8   9 10 NA NA NA NA
2007    9  10 NA NA NA NA NA
2007   15  16 17 18 19 NA NA
2007   16  17 18 19 NA NA NA

For every id/date combination, I need to check the hierarchy table for rules (for that year, as they change each year). If the condition category cc in df matches the hierarchy rule ifcc in h, then if any of the values in h$V2-h$V7 are present for that id/date combination in df, the rows that match h$V2 to h$V7 need to be dropped from df.

I'm having a hard time wrapping my head around the lookups and application of the hierarchies. there is SAS code which is not only for one year at a time and a series of manually types out if/then statements. Someone has given it a go (http://healthydatascience.com/cms_hcc.html) but not quite for this scale/multiple years and so many repeated measures...

I'm able to iterate through with loops and create a logical matrix of matches, but takes a while for only 30k lookups, not feasible to scale to 12.5m.

t <- matrix(nrow=nrow(df), ncol=nrow(h))
for (j in 1:nrow(df)) {
  for (i in 1:nrow(h)){
  t[j,i] <- df[j,"cc"] == h[i,"ifcc"]
  }
}

I also can't figure out how to use this matrix as a basis for applying the hierarchy rules.

This question directly relates to discussions here: https://github.com/jackwasey/icd/issues/31 https://github.com/anobel/icdtohcc/issues/1

Update

I was able to come up with a functioning solution using a loop. h$cc <- h$ifcc

# Merge hierarchy rules with patient data
df <- merge(df, h, all.x = TRUE)

###########
# create empty list
todrop <- list()

# create a list of dataframes that contain the CCs that will be zero'd out
for (i in 1:6) {
  todrop[[i]] <- df[!is.na(df$ifcc),c(1,2,3+i)]
}

# rename all dfs in list to same column names, rbind into one df
todrop <- lapply(1:length(todrop), function(x) {
  names(todrop[[x]]) <- c("id", "admtdate", "cc")
  return(todrop[[x]])
  }
)
todrop <- do.call(rbind, todrop)

# set flag, all of these CCs will be dropped
todrop$todrop <- T

# merge drop flags with pt data
df <- merge(df, todrop, all.x=T)
df <- df[is.na(pt$todrop), ]

Upvotes: 2

Views: 358

Answers (1)

Kunal Puri
Kunal Puri

Reputation: 3427

An Alternative Solution using data.table:

As you asked for in the question, the code is very much efficient.

.I is the inbuilt functionality provided by data.table which represents the row number of the data table.

library(data.table)

## convert data.frame to data.table
setDT(df)

setDT(h)

## find year from date
df[,year := year(date)]

## merge the two datasets with all values of x present

## if order of tuples doesn't matter, please eliminate the sort=F argument
df2 <- merge(df,h[,.(year,ifcc,.I),],by.x = c('year','cc'),by.y = c('year','ifcc'),all.x = T,sort=F)

## obtaining df having NA values
df <- df2[is.na(I),.(id,date,cc)]

## converting back to data.frame (do it only if required)
setDF(df)

Upvotes: 1

Related Questions