Wimpel
Wimpel

Reputation: 27732

Get number of events during interval the most efficient way

sample data

I've got a data.table with events (dt), and a data.table with all minutes over e certain period (dt.minutes).

dt <- data.table( id    = 1:3, 
                  start = c("2019-01-01 18:00:00", "2019-01-01 19:00:00", "2019-01-01 20:00:00"),
                  end   = c("2019-01-01 21:00:00", "2019-01-01 20:15:00", "2019-01-01 20:30:00") )
dt[, c("start", "end") := lapply( .SD, 
                                  as.POSIXct, 
                                  format = "%Y-%m-%d %H:%M:%S", 
                                  tz = "Europe/Amsterdam"),
   .SDcols = c("start", "end")]

dt.minutes <- data.table( from = seq( from = as.POSIXct( "2019-01-01 00:00:00", 
                                                         format = "%Y-%m-%d %H:%M:%S", 
                                                         tz = "Europe/Amsterdam"), 
                                      to   = as.POSIXct( "2019-01-05 00:00:00", 
                                                         format = "%Y-%m-%d %H:%M:%S", 
                                                         tz = "Europe/Amsterdam"), 
                                      by   = "1 min") )
dt.minutes[, to := from + 59 ][]

setkey( dt, start, end)
setkey( dt.minutes, from, to )

looks like this

> dt
   id               start                 end
1:  1 2019-01-01 18:00:00 2019-01-01 21:00:00
2:  2 2019-01-01 19:00:00 2019-01-01 20:15:00
    3:  3 2019-01-01 20:00:00 2019-01-01 20:30:00

> dt.minutes
                     from                  to
   1: 2019-01-01 00:00:00 2019-01-01 00:00:59
   2: 2019-01-01 00:01:00 2019-01-01 00:01:59
   3: 2019-01-01 00:02:00 2019-01-01 00:02:59
   4: 2019-01-01 00:03:00 2019-01-01 00:03:59
   5: 2019-01-01 00:04:00 2019-01-01 00:04:59
  ---                                        
5757: 2019-01-04 23:56:00 2019-01-04 23:56:59
5758: 2019-01-04 23:57:00 2019-01-04 23:57:59
5759: 2019-01-04 23:58:00 2019-01-04 23:58:59
5760: 2019-01-04 23:59:00 2019-01-04 23:59:59
5761: 2019-01-05 00:00:00 2019-01-05 00:00:59

problem

For each row (=minute) in dt.minutes, I want to know how many events from dt were taking place anywere during this minute.

I could come up with two possible data.table solutions:

setkey( dt, start, end)
setkey( dt.minutes, from, to ) 

#method 1: non-equi join
ans1 <- dt.minutes[ dt.minutes, N := {
  num = dt[ start <= i.to & end >= i.from ]
  list( nrow(num) )
}, by = .EACHI ][]

#method 2: use foverlaps, summarise on `from` and then update-join
ans2 <- dt.minutes[, N:=0L][ foverlaps( dt, copy(dt.minutes) )[, .(N =.N), by = .(from)], N := i.N, on = .(from)]

Both methods work and provide the answer i need

all.equal( ans1, ans2 )
# [1] TRUE

But when I look at the benchmarks, foverlaps() wins by a landslide..

# Unit: milliseconds
#          expr       min        lq       mean    median        uq       max neval
# non_equi_join 2074.0594 2097.3363 2111.87762 2100.1306 2116.6965 2171.1653     5
# foverlaps       10.5716   10.8999   10.93622   10.9011   10.9479   11.3606     5
# 

microbenchmark::microbenchmark(
  non_equi_join = {
    DT <- copy(dt)
    DT2 <- copy(dt.minutes)
    setkey( DT2, from, to )
    DT2[ DT2, N := {
      num = DT[ start <= i.to & end >= i.from ]
      list( nrow(num) )
    }, by = .EACHI ][]
  },
  foverlaps = {
    DT <- copy(dt)
    DT2 <- copy(dt.minutes)
    setkey( DT, start, end)
    setkey( DT2, from, to )
    DT2[, N := 0L][ foverlaps( DT, copy(DT2) )[, .( N = .N ), by = .(from)], N := i.N, on = .(from)]
  }, times = 5L
)

question(s)

In the spirit of better understanding data.table joins, I'm looking for the reason why my join (ans1) is taking so long (200x more slow) in comparison with foverlaps() (ans2).

Is there a way to increase the performace of the join? Or is foverlaps() just the optimised tool for this job?

Or are there even faster ways to achieve my goal?

Upvotes: 3

Views: 124

Answers (1)

Alexis
Alexis

Reputation: 5059

First of all, I'm not sure if the default type of foverlaps is what you want. Take for instance:

> foverlaps(dt.minutes, dt)[1368]
   id               start                 end                from                  to
1:  1 2019-01-01 18:00:00 2019-01-01 21:00:00 2019-01-01 21:00:00 2019-01-01 21:00:59

That does behave like the documentation specifies, but it doesn't seem to be what you're after (id should be NA). You might need type = "within".


I'm not familiar with the internals of data.table, so a bit of the following is an educated guess.

The thing about summarizing while joining when using by = .EACHI is that it is meant to optimize memory usage, not speed. If each resulting group in the join is pretty big, it might be worth materializing only parts of it each time, but whatever code you pass to j is R code (usually, see comments below), i.e. not compiled code. The base code for joining might be entirely evaluated in C, but if you use by = .EACHI, finding the matching rows for the join might be fast, but evaluating j becomes essentially a loop in R across the groups, and the associated time overhead adds up if there are a lot of small groups (like in your problem).

I came up with another 2 alternatives (and modified the setup a bit), and the benchmark in my system looks like this:

library(data.table)

dt <- data.table( id    = 1:3, 
                  start = c("2019-01-01 18:00:00", "2019-01-01 19:00:00", "2019-01-01 20:00:00"),
                  end   = c("2019-01-01 21:00:00", "2019-01-01 20:15:00", "2019-01-01 20:30:00") )
dt[, c("start", "end") := lapply( .SD, 
                                  as.POSIXct, 
                                  format = "%Y-%m-%d %H:%M:%S", 
                                  tz = "Europe/Amsterdam"),
   .SDcols = c("start", "end")]

dt.minutes <- data.table( from = seq( from = as.POSIXct( "2019-01-01 00:00:00", 
                                                         format = "%Y-%m-%d %H:%M:%S", 
                                                         tz = "Europe/Amsterdam"), 
                                      to   = as.POSIXct( "2019-01-05 00:00:00", 
                                                         format = "%Y-%m-%d %H:%M:%S", 
                                                         tz = "Europe/Amsterdam"), 
                                      by   = "1 min") )
dt.minutes[, to := from + 59 ]

library(microbenchmark)

microbenchmark::microbenchmark(
  times = 5L,
  non_equi_join = {
    DT <- copy(dt)
    DT2 <- copy(dt.minutes)
    setkey( DT, start, end)
    setkey( DT2, from, to )
    DT2[ DT2, N := {
      num = DT[ start <= i.to & end >= i.from ]
      list( nrow(num) )
    }, by = .EACHI ]
  },
  foverlaps = {
    DT <- copy(dt)
    DT2 <- copy(dt.minutes)
    setkey( DT, start, end)
    setkey( DT2, from, to )
    DT2[, N := 0L][ foverlaps( DT, copy(DT2) )[, .( N = .N ), by = .(from)], N := i.N, on = .(from)]
  },
  nej = {
    DT <- copy(dt)
    DT2 <- copy(dt.minutes)
    setkey( DT, start, end)
    setkey( DT2, from, to )
    DT2[, N := DT[.SD, .(id, start), on = .(start <= from, end >= to), allow.cartesian = TRUE
                  ][, sum(!is.na(id)), by = "start"]$V1]
  },
  fo = {
    DT <- copy(dt)
    DT2 <- copy(dt.minutes)
    setkey( DT, start, end)
    setkey( DT2, from, to )
    DT2[, N := foverlaps(DT2, DT, type="within", which=TRUE)[, sum(!is.na(yid)), by="xid"]$V1]
  }
)
Unit: milliseconds
          expr       min        lq       mean    median        uq       max neval
 non_equi_join 2506.3448 2535.3132 2597.71440 2565.4727 2647.7538 2733.6875     5
     foverlaps   13.8878   14.3945   14.66726   14.9400   15.0491   15.0649     5
           nej   11.6391   12.0179   13.89408   13.2644   13.3602   19.1888     5
            fo   11.4082   12.7889   13.77820   12.9216   13.0430   18.7293     5

*The results of my versions don't match yours because of what I mentioned in the beginning regarding type.

We can see that they are not much faster than what you had, but the interesting thing to note is the nej version. A non-equi join is also used, but without by = .EACHI. The whole result of the join is first materialized, and only afterwards we aggregate the result, and that is faster in this case. Unfortunately I can't tell you exactly why (again, not familiar with internals), but the general rule of thumb should be that by = .EACHI should only be used if you expect few big groups in the result, or if the code in j can be optimized by data.table.

BTW, in the fo version I use which = TRUE to avoid returning all columns from the join, returning only the indices. Since the amount of entries is what matters, returning indices with matches works similarly. It didn't make a huge difference in this case.

*Do note that foverlaps' documentation mentions that usually the larger table should be provided in x.

EDIT: Frank's version does seem to be fastest:

dt.minutes[, n := dt[.SD, on=.(start <= from, end >= to), allow.cartesian=TRUE, .N, by=.EACHI]$N]

Upvotes: 3

Related Questions