BiXiC
BiXiC

Reputation: 973

merge data tables by time intervals overlap

Suppose I have two tables. One with appointments and second with receptions. Each table has filial ID, medic ID, start and end time (plan for appointments and fact for receptions) and some other data. I want to count how much of appointments have receptions inside time interval of appointment period. Reception fact can begin before appointment start time, after, it can be inside app. interval, etc.

Below I made two tables. One for appointments and one for receptions. I wrote nested loop but it works very slow. My tables contains approximately 50 mio rows each. I need fast solution for this problem. How can I do this without loop? Thanks in advance!

library(data.table)

date <- as.POSIXct('2015-01-01 14:30:00')

# appointments data table
app <- data.table(med.id = 1:10,
                  filial.id = rep(c(100,200), each = 5),
                  start.time = rep(seq(date, length.out = 5, by = "hours"),2),
                  end.time = rep(seq(date+3599, length.out = 5, by = "hours"),2),
                  A = rnorm(10))


# receptions data table
re <- data.table(med.id = c(1,11,3,4,15,6,7),
                 filial.id = c(rep(100, 5), 200,200),
                 start.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('14:25:00', '14:25:00','16:32:00', '17:25:00', '16:10:00', '15:35:00','15:50:00'))),
                 end.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('15:25:00', '15:20:00','17:36:00', '18:40:00', '16:10:00', '15:49:00','16:12:00'))),
                 B = rnorm(7))



app$count <- 0

for (i in 1:dim(app)[1]){
  for (j in 1:dim(re)[1]){
    if ((app$med.id[i] == re$med.id[j]) & # med.id is equal and
        app$filial.id[i] == re$filial.id[j]) { # filial.id is equal
      if ((re$start.time[j] < app$start.time[i]) & (re$end.time[j] > app$start.time[i])) { # reception starts before appointment start time and ends after appointment start time OR 
        app$count[i] <- app$count[i] + 1
      } else if ((re$start.time[j] < app$end.time[i]) & (re$start.time[j] > app$start.time[i])) { # reception starts before appointment end time and after app. start time
        app$count[i] <- app$count[i] + 1
      }
    }
  }
}

Upvotes: 0

Views: 262

Answers (2)

Arun
Arun

Reputation: 118779

Using foverlaps():

setkey(re, med.id, filial.id, start.time, end.time)
olaps = foverlaps(app, re, which=TRUE, nomatch=0L)[, .N, by=xid]
app[, count := 0L][olaps$xid, count := olaps$N]
app
#     med.id filial.id          start.time            end.time           A count
#  1:      1       100 2015-01-01 14:30:00 2015-01-01 15:29:59  0.60878560     1
#  2:      2       100 2015-01-01 15:30:00 2015-01-01 16:29:59 -0.11545284     0
#  3:      3       100 2015-01-01 16:30:00 2015-01-01 17:29:59  0.68992084     1
#  4:      4       100 2015-01-01 17:30:00 2015-01-01 18:29:59  0.04703938     1
#  5:      5       100 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.95315419     0
#  6:      6       200 2015-01-01 14:30:00 2015-01-01 15:29:59  0.26193554     0
#  7:      7       200 2015-01-01 15:30:00 2015-01-01 16:29:59  1.55206077     1
#  8:      8       200 2015-01-01 16:30:00 2015-01-01 17:29:59  0.44517362     0
#  9:      9       200 2015-01-01 17:30:00 2015-01-01 18:29:59  0.11475881     0
# 10:     10       200 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.66139828     0

PS: please go through the vignettes and learn to use data tables effectively.

Upvotes: 2

cocquemas
cocquemas

Reputation: 1149

I actually don't think you need to merge by time overlap at all: your code is actually merging by med.id and filial.id then performing a simple comparison.

First, for clarity, let's rename the start.time and end.time fields:

setnames(app, c("start.time", "end.time"), c("app.start.time", "app.end.time"))
setnames(re, c("start.time", "end.time"), c("re.start.time", "re.end.time"))

You should then merge the two data.tables on the keys med.id and filial.id, like this:

app_re <- re[app, on=c("med.id", "filial.id")]
#    med.id filial.id       re.start.time         re.end.time          B
# 1:      1       100 2015-01-01 14:25:00 2015-01-01 15:25:00  0.4307760
# 2:      2       100                <NA>                <NA>         NA
# 3:      3       100 2015-01-01 16:32:00 2015-01-01 17:36:00 -1.2933755
# 4:      4       100 2015-01-01 17:25:00 2015-01-01 18:40:00 -1.2374469
# 5:      5       100                <NA>                <NA>         NA
# 6:      6       200 2015-01-01 15:35:00 2015-01-01 15:49:00 -0.8054822
# 7:      7       200 2015-01-01 15:50:00 2015-01-01 16:12:00  2.5742241
# 8:      8       200                <NA>                <NA>         NA
# 9:      9       200                <NA>                <NA>         NA
# 10:    10       200                <NA>                <NA>         NA
#          app.start.time        app.end.time           A
# 1:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.26828337
# 2:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.24246341
# 3:  2015-01-01 16:30:00 2015-01-01 17:29:59  1.55824948
# 4:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.25829302
# 5:  2015-01-01 18:30:00 2015-01-01 19:29:59  1.14244558
# 6:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.41234563
# 7:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.07710022
# 8:  2015-01-01 16:30:00 2015-01-01 17:29:59 -1.46421985
# 9:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.21682394
# 10: 2015-01-01 18:30:00 2015-01-01 19:29:59  1.11197318

You can then create your count variable with the same conditions as before:

app_re[, count := 
  as.numeric(re.start.time < app.start.time & re.end.time > app.start.time) | 
    (re.start.time < app.end.time & re.start.time > app.start.time)]
# Convert the NAs to 0
app_re[, count := ifelse(is.na(count), 0, count)]

This should be much faster than the for loops.

Upvotes: 1

Related Questions