R - Faster way of creating conditional sums as new column in data frame

Question

My initial problem

I have a data frame of hospital visits that looks like this:

df = data.frame(PNUM = c(1,1,1,1,2,2,2,2),
                indate=as.Date(c("2016-01-03","2016-05-05","2017-02-03",
                                 "2017-06-07","2016-01-03","2016-05-05",
                                 "2017-02-03","2017-06-07")),
                Inpatient=c(0,1,0,1,1,1,1,0),
                AnE=c(1,0,1,0,0,0,0,1))

Output:

  PNUM     indate Inpatient AnE
1    1 2016-01-03         0   1
2    1 2016-05-05         1   0
3    1 2017-02-03         0   1
4    1 2017-06-07         1   0
5    2 2016-01-03         1   0
6    2 2016-05-05         1   0
7    2 2017-02-03         1   0
8    2 2017-06-07         0   1

I now want to add columns that reflect the number of "Inpatient" and "AnE" visits in the 365 days prior to the current "indate". The desired result looks like this:

  PNUM     indate Inpatient AnE sum_365_Inpatient sum_365_AnE
1    1 2016-01-03         0   1                 0           0
2    1 2016-05-05         1   0                 0           1
3    1 2017-02-03         0   1                 1           0
4    1 2017-06-07         1   0                 0           1
5    2 2016-01-03         1   0                 0           0
6    2 2016-05-05         1   0                 1           0
7    2 2017-02-03         1   0                 1           0
8    2 2017-06-07         0   1                 1           0

I have found a way to do this (see below) but it is very slow (~ 4 min for 1 new column with 10,000 rows). My original data frame has 2 mio rows and >100 columns for which I want to create these sums. I'm relatively new to R, and created the below solution by putting together things from several similar problems. I guess it's not very efficient. I would be grateful for any suggestions of how to improve my code.

Here is my very inefficient solution

I first define a function that calculates the sum of a specific column looking back X days (addtionally restricted by ID as I only want events from the same person)

# Function definition

hist_sum = function(colname,ID,date_input,x) { 
    # window start and end
    window_start = date_input - x
    window_end = date_input
    # Calculate sum within window
    sum(df[(df$PNUM == ID) & (df$indate >= window_start) &
           (df$indate < window_end),c(colname)])
}


# Vectorise function

hist_sum = Vectorize(hist_sum)

I then use a for loop and dplyr's mutate function to calculate the sums for "Inpatient" and "AnE" columns, using PNUM as ID, indate = as event date, a window of 365 days (and create a unique column name for each):

library(dplyr)

for (i in c("Inpatient","AnE")) {
    # Generate column title
    coltitle = paste("sum",as.character(j),i,sep="_")
    # Apply 
    df = mutate(df, !!coltitle := hist_sum(i,PNUM,indate,365))
}

Frank · Accepted Answer

A non-equi join is designed to do this.

For the case of mutually exclusive dummy columns...

First, some setup...

# go to long form

library(data.table)
DT = melt(setDT(df), id=c("PNUM", "indate"), variable.name = "status")[value == 1, !"value"]
setorder(DT, PNUM, indate)

# use integer dates

DT[, indate := as.IDate(indate)]


   PNUM     indate    status
1:    1 2016-01-03       AnE
2:    1 2016-05-05 Inpatient
3:    1 2017-02-03       AnE
4:    1 2017-06-07 Inpatient
5:    2 2016-01-03 Inpatient
6:    2 2016-05-05 Inpatient
7:    2 2017-02-03 Inpatient
8:    2 2017-06-07       AnE

Count 'em

for (s in unique(DT$status)){
  DT[, paste0("n365_", s) := 
    .SD[status == s][.SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)], 
      on=.(PNUM, indate >= d_dn, indate < d_up),
      .N, by=.EACHI]$N
 ][]
}

   PNUM     indate    status n365_AnE n365_Inpatient
1:    1 2016-01-03       AnE        0              0
2:    1 2016-05-05 Inpatient        1              0
3:    1 2017-02-03       AnE        0              1
4:    1 2017-06-07 Inpatient        1              0
5:    2 2016-01-03 Inpatient        0              0
6:    2 2016-05-05 Inpatient        0              1
7:    2 2017-02-03 Inpatient        0              1
8:    2 2017-06-07       AnE        0              1

How it works. Written more verbosely:

for (s in unique(DT$status)){
  DT[, paste0("n365_", s) := {

    # define the ranges we are interested in
    look_these_up = .SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)]

    # define where we are looking
    look_in_here = .SD[status == s]

    # do the lookup
    # counting rows of look_in_here (.N)
    look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up),
      .N, by=.EACHI]$N
 }][]
}

The syntax for a data.table join is x[i, on=, j] where we use the on= rules to look up each row of i in x and then do j. See ?data.table for details.

For the case of possibly overlapping dummy columns...

This possibility was brought up by the OP in a comment. In this case, we can't go to long form and collapse to a single "status" column.

library(data.table)
DT = data.table(df)
mycols = setdiff(names(DT), c("PNUM", "indate"))

# use integer dates
DT[, indate := as.IDate(indate)]

# use integer dummies
DT[, (mycols) := lapply(.SD, as.integer), .SDcols=mycols]

DT[, paste0("n365_", mycols) := {
  # define the ranges we are interested in
  look_these_up = DT[, .(PNUM, d_dn = indate - 365L, d_up = indate)]

  lapply(mycols, function(s){
    # define where we are looking
    look_in_here = .SD[get(s) == 1L]

    # do the lookup, counting rows of look_in_here (.N)
    look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up),
    .N, by=.EACHI]$N
  })
 }][]

Generally, joins/lookups are faster against integers than floating point numbers, which is why that conversion is done here. The lapply and for loop ways are equivalent, though the lapply way only involves constructing look_these_up once, and so may be faster.

R - Faster way of creating conditional sums as new column in data frame

Answers (1)

Related Questions