Reputation: 35
My initial problem
I have a data frame of hospital visits that looks like this:
df = data.frame(PNUM = c(1,1,1,1,2,2,2,2),
indate=as.Date(c("2016-01-03","2016-05-05","2017-02-03",
"2017-06-07","2016-01-03","2016-05-05",
"2017-02-03","2017-06-07")),
Inpatient=c(0,1,0,1,1,1,1,0),
AnE=c(1,0,1,0,0,0,0,1))
Output:
PNUM indate Inpatient AnE
1 1 2016-01-03 0 1
2 1 2016-05-05 1 0
3 1 2017-02-03 0 1
4 1 2017-06-07 1 0
5 2 2016-01-03 1 0
6 2 2016-05-05 1 0
7 2 2017-02-03 1 0
8 2 2017-06-07 0 1
I now want to add columns that reflect the number of "Inpatient" and "AnE" visits in the 365 days prior to the current "indate". The desired result looks like this:
PNUM indate Inpatient AnE sum_365_Inpatient sum_365_AnE
1 1 2016-01-03 0 1 0 0
2 1 2016-05-05 1 0 0 1
3 1 2017-02-03 0 1 1 0
4 1 2017-06-07 1 0 0 1
5 2 2016-01-03 1 0 0 0
6 2 2016-05-05 1 0 1 0
7 2 2017-02-03 1 0 1 0
8 2 2017-06-07 0 1 1 0
I have found a way to do this (see below) but it is very slow (~ 4 min for 1 new column with 10,000 rows). My original data frame has 2 mio rows and >100 columns for which I want to create these sums. I'm relatively new to R, and created the below solution by putting together things from several similar problems. I guess it's not very efficient. I would be grateful for any suggestions of how to improve my code.
Here is my very inefficient solution
I first define a function that calculates the sum of a specific column looking back X days (addtionally restricted by ID as I only want events from the same person)
# Function definition
hist_sum = function(colname,ID,date_input,x) {
# window start and end
window_start = date_input - x
window_end = date_input
# Calculate sum within window
sum(df[(df$PNUM == ID) & (df$indate >= window_start) &
(df$indate < window_end),c(colname)])
}
# Vectorise function
hist_sum = Vectorize(hist_sum)
I then use a for loop and dplyr's mutate function to calculate the sums for "Inpatient" and "AnE" columns, using PNUM as ID, indate = as event date, a window of 365 days (and create a unique column name for each):
library(dplyr)
for (i in c("Inpatient","AnE")) {
# Generate column title
coltitle = paste("sum",as.character(j),i,sep="_")
# Apply
df = mutate(df, !!coltitle := hist_sum(i,PNUM,indate,365))
}
Upvotes: 1
Views: 232
Reputation: 66819
A non-equi join is designed to do this.
For the case of mutually exclusive dummy columns...
First, some setup...
# go to long form
library(data.table)
DT = melt(setDT(df), id=c("PNUM", "indate"), variable.name = "status")[value == 1, !"value"]
setorder(DT, PNUM, indate)
# use integer dates
DT[, indate := as.IDate(indate)]
PNUM indate status
1: 1 2016-01-03 AnE
2: 1 2016-05-05 Inpatient
3: 1 2017-02-03 AnE
4: 1 2017-06-07 Inpatient
5: 2 2016-01-03 Inpatient
6: 2 2016-05-05 Inpatient
7: 2 2017-02-03 Inpatient
8: 2 2017-06-07 AnE
Count 'em
for (s in unique(DT$status)){
DT[, paste0("n365_", s) :=
.SD[status == s][.SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)],
on=.(PNUM, indate >= d_dn, indate < d_up),
.N, by=.EACHI]$N
][]
}
PNUM indate status n365_AnE n365_Inpatient
1: 1 2016-01-03 AnE 0 0
2: 1 2016-05-05 Inpatient 1 0
3: 1 2017-02-03 AnE 0 1
4: 1 2017-06-07 Inpatient 1 0
5: 2 2016-01-03 Inpatient 0 0
6: 2 2016-05-05 Inpatient 0 1
7: 2 2017-02-03 Inpatient 0 1
8: 2 2017-06-07 AnE 0 1
How it works. Written more verbosely:
for (s in unique(DT$status)){
DT[, paste0("n365_", s) := {
# define the ranges we are interested in
look_these_up = .SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)]
# define where we are looking
look_in_here = .SD[status == s]
# do the lookup
# counting rows of look_in_here (.N)
look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up),
.N, by=.EACHI]$N
}][]
}
The syntax for a data.table join is x[i, on=, j]
where we use the on=
rules to look up each row of i
in x
and then do j
. See ?data.table
for details.
For the case of possibly overlapping dummy columns...
This possibility was brought up by the OP in a comment. In this case, we can't go to long form and collapse to a single "status" column.
library(data.table)
DT = data.table(df)
mycols = setdiff(names(DT), c("PNUM", "indate"))
# use integer dates
DT[, indate := as.IDate(indate)]
# use integer dummies
DT[, (mycols) := lapply(.SD, as.integer), .SDcols=mycols]
DT[, paste0("n365_", mycols) := {
# define the ranges we are interested in
look_these_up = DT[, .(PNUM, d_dn = indate - 365L, d_up = indate)]
lapply(mycols, function(s){
# define where we are looking
look_in_here = .SD[get(s) == 1L]
# do the lookup, counting rows of look_in_here (.N)
look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up),
.N, by=.EACHI]$N
})
}][]
Generally, joins/lookups are faster against integers than floating point numbers, which is why that conversion is done here. The lapply
and for
loop ways are equivalent, though the lapply
way only involves constructing look_these_up
once, and so may be faster.
Upvotes: 4