Reputation: 973
Suppose I have two tables. One with appointments and second with receptions. Each table has filial ID, medic ID, start and end time (plan for appointments and fact for receptions) and some other data. I want to count how much of appointments have receptions inside time interval of appointment period. Reception fact can begin before appointment start time, after, it can be inside app. interval, etc.
Below I made two tables. One for appointments and one for receptions. I wrote nested loop but it works very slow. My tables contains approximately 50 mio rows each. I need fast solution for this problem. How can I do this without loop? Thanks in advance!
library(data.table)
date <- as.POSIXct('2015-01-01 14:30:00')
# appointments data table
app <- data.table(med.id = 1:10,
filial.id = rep(c(100,200), each = 5),
start.time = rep(seq(date, length.out = 5, by = "hours"),2),
end.time = rep(seq(date+3599, length.out = 5, by = "hours"),2),
A = rnorm(10))
# receptions data table
re <- data.table(med.id = c(1,11,3,4,15,6,7),
filial.id = c(rep(100, 5), 200,200),
start.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('14:25:00', '14:25:00','16:32:00', '17:25:00', '16:10:00', '15:35:00','15:50:00'))),
end.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('15:25:00', '15:20:00','17:36:00', '18:40:00', '16:10:00', '15:49:00','16:12:00'))),
B = rnorm(7))
app$count <- 0
for (i in 1:dim(app)[1]){
for (j in 1:dim(re)[1]){
if ((app$med.id[i] == re$med.id[j]) & # med.id is equal and
app$filial.id[i] == re$filial.id[j]) { # filial.id is equal
if ((re$start.time[j] < app$start.time[i]) & (re$end.time[j] > app$start.time[i])) { # reception starts before appointment start time and ends after appointment start time OR
app$count[i] <- app$count[i] + 1
} else if ((re$start.time[j] < app$end.time[i]) & (re$start.time[j] > app$start.time[i])) { # reception starts before appointment end time and after app. start time
app$count[i] <- app$count[i] + 1
}
}
}
}
Upvotes: 0
Views: 262
Reputation: 118779
Using foverlaps()
:
setkey(re, med.id, filial.id, start.time, end.time)
olaps = foverlaps(app, re, which=TRUE, nomatch=0L)[, .N, by=xid]
app[, count := 0L][olaps$xid, count := olaps$N]
app
# med.id filial.id start.time end.time A count
# 1: 1 100 2015-01-01 14:30:00 2015-01-01 15:29:59 0.60878560 1
# 2: 2 100 2015-01-01 15:30:00 2015-01-01 16:29:59 -0.11545284 0
# 3: 3 100 2015-01-01 16:30:00 2015-01-01 17:29:59 0.68992084 1
# 4: 4 100 2015-01-01 17:30:00 2015-01-01 18:29:59 0.04703938 1
# 5: 5 100 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.95315419 0
# 6: 6 200 2015-01-01 14:30:00 2015-01-01 15:29:59 0.26193554 0
# 7: 7 200 2015-01-01 15:30:00 2015-01-01 16:29:59 1.55206077 1
# 8: 8 200 2015-01-01 16:30:00 2015-01-01 17:29:59 0.44517362 0
# 9: 9 200 2015-01-01 17:30:00 2015-01-01 18:29:59 0.11475881 0
# 10: 10 200 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.66139828 0
PS: please go through the vignettes and learn to use data tables effectively.
Upvotes: 2
Reputation: 1149
I actually don't think you need to merge by time overlap at all: your code is actually merging by med.id
and filial.id
then performing a simple comparison.
First, for clarity, let's rename the start.time
and end.time
fields:
setnames(app, c("start.time", "end.time"), c("app.start.time", "app.end.time"))
setnames(re, c("start.time", "end.time"), c("re.start.time", "re.end.time"))
You should then merge the two data.tables on the keys med.id
and filial.id
, like this:
app_re <- re[app, on=c("med.id", "filial.id")]
# med.id filial.id re.start.time re.end.time B
# 1: 1 100 2015-01-01 14:25:00 2015-01-01 15:25:00 0.4307760
# 2: 2 100 <NA> <NA> NA
# 3: 3 100 2015-01-01 16:32:00 2015-01-01 17:36:00 -1.2933755
# 4: 4 100 2015-01-01 17:25:00 2015-01-01 18:40:00 -1.2374469
# 5: 5 100 <NA> <NA> NA
# 6: 6 200 2015-01-01 15:35:00 2015-01-01 15:49:00 -0.8054822
# 7: 7 200 2015-01-01 15:50:00 2015-01-01 16:12:00 2.5742241
# 8: 8 200 <NA> <NA> NA
# 9: 9 200 <NA> <NA> NA
# 10: 10 200 <NA> <NA> NA
# app.start.time app.end.time A
# 1: 2015-01-01 14:30:00 2015-01-01 15:29:59 -0.26828337
# 2: 2015-01-01 15:30:00 2015-01-01 16:29:59 0.24246341
# 3: 2015-01-01 16:30:00 2015-01-01 17:29:59 1.55824948
# 4: 2015-01-01 17:30:00 2015-01-01 18:29:59 1.25829302
# 5: 2015-01-01 18:30:00 2015-01-01 19:29:59 1.14244558
# 6: 2015-01-01 14:30:00 2015-01-01 15:29:59 -0.41234563
# 7: 2015-01-01 15:30:00 2015-01-01 16:29:59 0.07710022
# 8: 2015-01-01 16:30:00 2015-01-01 17:29:59 -1.46421985
# 9: 2015-01-01 17:30:00 2015-01-01 18:29:59 1.21682394
# 10: 2015-01-01 18:30:00 2015-01-01 19:29:59 1.11197318
You can then create your count variable with the same conditions as before:
app_re[, count :=
as.numeric(re.start.time < app.start.time & re.end.time > app.start.time) |
(re.start.time < app.end.time & re.start.time > app.start.time)]
# Convert the NAs to 0
app_re[, count := ifelse(is.na(count), 0, count)]
This should be much faster than the for
loops.
Upvotes: 1