R: Compare data.table and pass variable while respecting key

Question

I have two data.tables:

original <- data.frame(id = c(rep("RE01",5),rep("RE02",5)),date.time = head(seq.POSIXt(as.POSIXct("2015-11-01 01:00:00"),as.POSIXct("2015-11-05 01:00:00"),60*60*10),10))
compare <- data.frame(id = c("RE01","RE02"),seq = c(1,2),start = as.POSIXct(c("2015-11-01 20:00:00","2015-11-04 08:00:00")),end = as.POSIXct(c("2015-11-02 08:00:00","2015-11-04 20:00:00")))
setDT(original)
setDT(compare)

I would like to check the date in each row of original and see if it lies between the start and finish dates of compare whilst respecting the id. If it does lie between the two elements, a variable should be passed to original (compare$diff.seq). The output should look like this:

    original
     id           date.time diff.seq
1  RE01 2015-11-01 01:00:00       NA
2  RE01 2015-11-01 11:00:00       NA
3  RE01 2015-11-01 21:00:00        1
4  RE01 2015-11-02 07:00:00        1
5  RE01 2015-11-02 17:00:00       NA
6  RE02 2015-11-03 03:00:00       NA
7  RE02 2015-11-03 13:00:00       NA
8  RE02 2015-11-03 23:00:00       NA
9  RE02 2015-11-04 09:00:00        2
10 RE02 2015-11-04 19:00:00        2

I've been reading the manual and SO for hours and trying "on", "by" and so on.. without any success. Can anybody point me in the right direction?

David Arenburg · Accepted Answer

As said in the comments, this is very straight forward using data.table::foverlaps

You basically have to create an additional column in the original data set in order to set join boundaries, then key the two data sets by the columns you want to join on and then simply run forverlas and select the desired columns

original[, end := date.time]
setkey(original, id, date.time, end)
setkey(compare, id, start, end)
foverlaps(original, compare)[, .(id, date.time, seq)]
#       id           date.time seq
#  1: RE01 2015-11-01 01:00:00  NA
#  2: RE01 2015-11-01 11:00:00  NA
#  3: RE01 2015-11-01 21:00:00   1
#  4: RE01 2015-11-02 07:00:00   1
#  5: RE01 2015-11-02 17:00:00  NA
#  6: RE02 2015-11-03 03:00:00  NA
#  7: RE02 2015-11-03 13:00:00  NA
#  8: RE02 2015-11-03 23:00:00  NA
#  9: RE02 2015-11-04 09:00:00   2
# 10: RE02 2015-11-04 19:00:00   2

Alternatively, you can run foverlaps the other way around and then just update the original data set by reference while selecting the correct rows to update

indx <- foverlaps(compare, original, which = TRUE)
original[indx$yid, diff.seq := indx$xid]
original
#       id           date.time                 end diff.seq
#  1: RE01 2015-11-01 01:00:00 2015-11-01 01:00:00       NA
#  2: RE01 2015-11-01 11:00:00 2015-11-01 11:00:00       NA
#  3: RE01 2015-11-01 21:00:00 2015-11-01 21:00:00        1
#  4: RE01 2015-11-02 07:00:00 2015-11-02 07:00:00        1
#  5: RE01 2015-11-02 17:00:00 2015-11-02 17:00:00       NA
#  6: RE02 2015-11-03 03:00:00 2015-11-03 03:00:00       NA
#  7: RE02 2015-11-03 13:00:00 2015-11-03 13:00:00       NA
#  8: RE02 2015-11-03 23:00:00 2015-11-03 23:00:00       NA
#  9: RE02 2015-11-04 09:00:00 2015-11-04 09:00:00        2
# 10: RE02 2015-11-04 19:00:00 2015-11-04 19:00:00        2

R: Compare data.table and pass variable while respecting key

Answers (1)

Related Questions