within groups, finding the correct interval a date resides in across multiple groups

Question

In one data set I have account numbers (all unique) and the date that some event happened in the account.

In another data set I have the account numbers and an account status indicator along with the date the account entered and exitied that status level. The accounts are listed multiple times as the status dates produce a history of when the account moved from different statuses.

I wish to append the status that the account was in on the date the event happend to the first data set.

I have built a loop that performs this task but given the number of accounts I am working with, the length of the history, and the number of status "switches", the loop takes a few hours to run on my system.

Since the account numbers are the same between the two files I was wondering if there was a way to use the setkey functionality on account number using the data.table package and to use a data.table approach to speed up the append process. Basically within each account I need to see which interval the date from the first data set is in in the second data set to get the status.

Here is the code I've produced for my loop along with some toy data. I have tried using intervals in the lubridate package but this was giving me some issues with the data.table in the loop so I've gone with the between() command.

Does anyone have any ideas for a more efficient append process.

library(data.table)
library(lubridate)

set.seed(65)
# data set 1
dt1 <- data.table(account=c(1234,1235,1236,1237,1238),
            eventDate=c(ymd(20170123),ymd(20170223),ymd(20170114),ymd(20170205),ymd(20170127)))
setkey(dt1,account)

# data set 2
se1 <- seq(from=ymd(20161201),to=ymd(20170228), length.out=4)
se2 <- seq(from=ymd(20170101),to=ymd(20170228), length.out=5)
se3 <- seq(from=ymd(20170103),to=ymd(20170228), length.out=4)
se4 <- seq(from=ymd(20160101),to=ymd(20170228), length.out=3)
se5 <- seq(from=ymd(20161101),to=ymd(20170228), length.out=6)
ss1 <- c(se1[1]-days(23),se1[-length(se1)]+days(1))
ss2 <- c(se2[1]-days(13),se2[-length(se2)]+days(1))
ss3 <- c(se3[1]-days(3),se3[-length(se3)]+days(1))
ss4 <- c(se4[1]-days(53),se4[-length(se4)]+days(1))
ss5 <- c(se5[1]-days(2),se5[-length(se5)]+days(1))

dt2 <- data.table(account=c(rep(1234,4),rep(1235,5),rep(1236,4),rep(1237,3),rep(1238,6)),
            status=sample(LETTERS,22, replace=T),
            statusStart=c(ss1,ss2,ss3,ss4,ss5),
            statusEnd=c(se1,se2,se3,se4,se5))

setkey(dt2,account)

#dt2[,interv:=interval(statusStart,statusEnd)]

# set up and do the loop    
accnts <- dt1[,unique(account)]

for(i in 1:length(accnts)){
    dt2[    account==accnts[i] & 
        between(dt1[account==accnts[i],eventDate],statusStart,statusEnd,incbounds=T),
        eventDate:=dt1[account==accnts[i],eventDate]]   
}

# put it back in the first data set
dt1 <- merge(dt1,dt2[!is.na(eventDate),list(account,eventDate,status)], by=c('account','eventDate'),all.x=T)

oropendola · Accepted Answer

Here's one way to do it, using the foverlaps function from data.table:

dt1$statusStart <- dt1$eventDate
dt1$statusEnd <- dt1$eventDate
setkey(dt2, account, statusStart, statusEnd)
foverlaps(dt1, dt2, by.x = c('account', 'statusStart', 'statusEnd'), by.y = c('account', 'statusStart', 'statusEnd'), type = 'within')

within groups, finding the correct interval a date resides in across multiple groups

Answers (1)

Related Questions