Reputation: 27732
sample data
I've got a data.table with events (dt
), and a data.table with all minutes over e certain period (dt.minutes
).
dt <- data.table( id = 1:3,
start = c("2019-01-01 18:00:00", "2019-01-01 19:00:00", "2019-01-01 20:00:00"),
end = c("2019-01-01 21:00:00", "2019-01-01 20:15:00", "2019-01-01 20:30:00") )
dt[, c("start", "end") := lapply( .SD,
as.POSIXct,
format = "%Y-%m-%d %H:%M:%S",
tz = "Europe/Amsterdam"),
.SDcols = c("start", "end")]
dt.minutes <- data.table( from = seq( from = as.POSIXct( "2019-01-01 00:00:00",
format = "%Y-%m-%d %H:%M:%S",
tz = "Europe/Amsterdam"),
to = as.POSIXct( "2019-01-05 00:00:00",
format = "%Y-%m-%d %H:%M:%S",
tz = "Europe/Amsterdam"),
by = "1 min") )
dt.minutes[, to := from + 59 ][]
setkey( dt, start, end)
setkey( dt.minutes, from, to )
looks like this
> dt
id start end
1: 1 2019-01-01 18:00:00 2019-01-01 21:00:00
2: 2 2019-01-01 19:00:00 2019-01-01 20:15:00
3: 3 2019-01-01 20:00:00 2019-01-01 20:30:00
> dt.minutes
from to
1: 2019-01-01 00:00:00 2019-01-01 00:00:59
2: 2019-01-01 00:01:00 2019-01-01 00:01:59
3: 2019-01-01 00:02:00 2019-01-01 00:02:59
4: 2019-01-01 00:03:00 2019-01-01 00:03:59
5: 2019-01-01 00:04:00 2019-01-01 00:04:59
---
5757: 2019-01-04 23:56:00 2019-01-04 23:56:59
5758: 2019-01-04 23:57:00 2019-01-04 23:57:59
5759: 2019-01-04 23:58:00 2019-01-04 23:58:59
5760: 2019-01-04 23:59:00 2019-01-04 23:59:59
5761: 2019-01-05 00:00:00 2019-01-05 00:00:59
problem
For each row (=minute) in dt.minutes
, I want to know how many events from dt
were taking place anywere during this minute.
I could come up with two possible data.table solutions:
setkey( dt, start, end)
setkey( dt.minutes, from, to )
#method 1: non-equi join
ans1 <- dt.minutes[ dt.minutes, N := {
num = dt[ start <= i.to & end >= i.from ]
list( nrow(num) )
}, by = .EACHI ][]
#method 2: use foverlaps, summarise on `from` and then update-join
ans2 <- dt.minutes[, N:=0L][ foverlaps( dt, copy(dt.minutes) )[, .(N =.N), by = .(from)], N := i.N, on = .(from)]
Both methods work and provide the answer i need
all.equal( ans1, ans2 )
# [1] TRUE
But when I look at the benchmarks, foverlaps()
wins by a landslide..
# Unit: milliseconds
# expr min lq mean median uq max neval
# non_equi_join 2074.0594 2097.3363 2111.87762 2100.1306 2116.6965 2171.1653 5
# foverlaps 10.5716 10.8999 10.93622 10.9011 10.9479 11.3606 5
#
microbenchmark::microbenchmark(
non_equi_join = {
DT <- copy(dt)
DT2 <- copy(dt.minutes)
setkey( DT2, from, to )
DT2[ DT2, N := {
num = DT[ start <= i.to & end >= i.from ]
list( nrow(num) )
}, by = .EACHI ][]
},
foverlaps = {
DT <- copy(dt)
DT2 <- copy(dt.minutes)
setkey( DT, start, end)
setkey( DT2, from, to )
DT2[, N := 0L][ foverlaps( DT, copy(DT2) )[, .( N = .N ), by = .(from)], N := i.N, on = .(from)]
}, times = 5L
)
question(s)
In the spirit of better understanding data.table joins, I'm looking for the reason why my join (ans1) is taking so long (200x more slow) in comparison with foverlaps()
(ans2).
Is there a way to increase the performace of the join? Or is foverlaps()
just the optimised tool for this job?
Or are there even faster ways to achieve my goal?
Upvotes: 3
Views: 124
Reputation: 5059
First of all, I'm not sure if the default type
of foverlaps
is what you want.
Take for instance:
> foverlaps(dt.minutes, dt)[1368]
id start end from to
1: 1 2019-01-01 18:00:00 2019-01-01 21:00:00 2019-01-01 21:00:00 2019-01-01 21:00:59
That does behave like the documentation specifies,
but it doesn't seem to be what you're after
(id
should be NA
).
You might need type = "within"
.
I'm not familiar with the internals of data.table
,
so a bit of the following is an educated guess.
The thing about summarizing while joining when using by = .EACHI
is that it is meant to optimize memory usage, not speed.
If each resulting group in the join is pretty big,
it might be worth materializing only parts of it each time,
but whatever code you pass to j
is R code
(usually, see comments below),
i.e. not compiled code.
The base code for joining might be entirely evaluated in C,
but if you use by = .EACHI
,
finding the matching rows for the join might be fast,
but evaluating j
becomes essentially a loop in R across the groups,
and the associated time overhead adds up if there are a lot of small groups
(like in your problem).
I came up with another 2 alternatives (and modified the setup a bit), and the benchmark in my system looks like this:
library(data.table)
dt <- data.table( id = 1:3,
start = c("2019-01-01 18:00:00", "2019-01-01 19:00:00", "2019-01-01 20:00:00"),
end = c("2019-01-01 21:00:00", "2019-01-01 20:15:00", "2019-01-01 20:30:00") )
dt[, c("start", "end") := lapply( .SD,
as.POSIXct,
format = "%Y-%m-%d %H:%M:%S",
tz = "Europe/Amsterdam"),
.SDcols = c("start", "end")]
dt.minutes <- data.table( from = seq( from = as.POSIXct( "2019-01-01 00:00:00",
format = "%Y-%m-%d %H:%M:%S",
tz = "Europe/Amsterdam"),
to = as.POSIXct( "2019-01-05 00:00:00",
format = "%Y-%m-%d %H:%M:%S",
tz = "Europe/Amsterdam"),
by = "1 min") )
dt.minutes[, to := from + 59 ]
library(microbenchmark)
microbenchmark::microbenchmark(
times = 5L,
non_equi_join = {
DT <- copy(dt)
DT2 <- copy(dt.minutes)
setkey( DT, start, end)
setkey( DT2, from, to )
DT2[ DT2, N := {
num = DT[ start <= i.to & end >= i.from ]
list( nrow(num) )
}, by = .EACHI ]
},
foverlaps = {
DT <- copy(dt)
DT2 <- copy(dt.minutes)
setkey( DT, start, end)
setkey( DT2, from, to )
DT2[, N := 0L][ foverlaps( DT, copy(DT2) )[, .( N = .N ), by = .(from)], N := i.N, on = .(from)]
},
nej = {
DT <- copy(dt)
DT2 <- copy(dt.minutes)
setkey( DT, start, end)
setkey( DT2, from, to )
DT2[, N := DT[.SD, .(id, start), on = .(start <= from, end >= to), allow.cartesian = TRUE
][, sum(!is.na(id)), by = "start"]$V1]
},
fo = {
DT <- copy(dt)
DT2 <- copy(dt.minutes)
setkey( DT, start, end)
setkey( DT2, from, to )
DT2[, N := foverlaps(DT2, DT, type="within", which=TRUE)[, sum(!is.na(yid)), by="xid"]$V1]
}
)
Unit: milliseconds
expr min lq mean median uq max neval
non_equi_join 2506.3448 2535.3132 2597.71440 2565.4727 2647.7538 2733.6875 5
foverlaps 13.8878 14.3945 14.66726 14.9400 15.0491 15.0649 5
nej 11.6391 12.0179 13.89408 13.2644 13.3602 19.1888 5
fo 11.4082 12.7889 13.77820 12.9216 13.0430 18.7293 5
*The results of my versions don't match yours because of what I mentioned in the beginning regarding type
.
We can see that they are not much faster than what you had,
but the interesting thing to note is the nej
version.
A non-equi join is also used,
but without by = .EACHI
.
The whole result of the join is first materialized,
and only afterwards we aggregate the result,
and that is faster in this case.
Unfortunately I can't tell you exactly why
(again, not familiar with internals),
but the general rule of thumb should be that by = .EACHI
should only be used if you expect few big groups in the result,
or if the code in j
can be optimized by data.table
.
BTW, in the fo
version I use which = TRUE
to avoid returning all columns from the join,
returning only the indices.
Since the amount of entries is what matters,
returning indices with matches works similarly.
It didn't make a huge difference in this case.
*Do note that foverlaps
' documentation mentions that usually the larger table should be provided in x
.
EDIT: Frank's version does seem to be fastest:
dt.minutes[, n := dt[.SD, on=.(start <= from, end >= to), allow.cartesian=TRUE, .N, by=.EACHI]$N]
Upvotes: 3