using roll in data.table join, can i force strictly one match

Question

I have two tables to join based on date-time fields. To recreate the scenario let us take the commercial vs sale example. We want to know which sale is related to which commercial.

Sale can only be tagged to the last commercial and only if it occurred after the commercial.

Also, if a sale happened after multiple commercials, we can only tag the sale to the last commercial; prior commercials will have a null in the join.

I am unable to get this last part. If there is a sale after multiple commercials, then all such commercials are joined with that sale; which I don't want. In my example, the sale that happened at "2017-01-01 02:05:00" should join with the commercial that aired at "2017-01-01 02:00:00" and not the prior commercials.

output of the code

library(lubridate)
library(data.table)

ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
          as.POSIXct("2017-01-02", tz = "UTC"),
          by = "30 min")

commercial <-
  data.table(
    c_row_number = 1:10,
    c_time       = ts[1:10],
    c_time_roll  = ts[1:10]
  )

sale <-
  data.table(
    s_row_number = 1:4,
    s_time       = ts[5:8] + minutes(5),
    s_time_roll  = ts[5:8] + minutes(5)
  )

setkey(commercial, c_time_roll)
setkey(sale, s_time_roll)

tbl_joined <- sale[commercial, roll = -Inf] # , mult = 'last']

Any idea how can we get NAs where c_row_number is 1, 2, 3, and 4. Thank you.

Alexis · Accepted Answer

If your commercial times are sorted, or you can sort them, then you can use a non-equi join with a helper column with shifted times:

library(lubridate)
library(data.table)

ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
          as.POSIXct("2017-01-02", tz = "UTC"),
          by = "30 min")

commercial <-
  data.table(
    c_row_number = 1:10,
    c_time       = ts[1:10],
    c_next_time  = shift(ts[1:10], type = "lead", fill = max(ts))
  )

sale <-
  data.table(
    s_row_number = 1:4,
    s_time       = ts[5:8] + minutes(5),
    s_time_join  = ts[5:8] + minutes(5)
  )

tbl_joined <- sale[commercial, on = .(s_time_join >= c_time, s_time_join < c_next_time)]

And if you want to use this idiom:

commercial[, s_time := sale[.SD,
                            .(s_time),
                            on = .(s_time_join >= c_time, s_time_join < c_next_time)]]
print(commercial)
    c_row_number              c_time         c_next_time              s_time
 1:            1 2017-01-01 00:00:00 2017-01-01 00:30:00                
 2:            2 2017-01-01 00:30:00 2017-01-01 01:00:00                
 3:            3 2017-01-01 01:00:00 2017-01-01 01:30:00                
 4:            4 2017-01-01 01:30:00 2017-01-01 02:00:00                
 5:            5 2017-01-01 02:00:00 2017-01-01 02:30:00 2017-01-01 02:05:00
 6:            6 2017-01-01 02:30:00 2017-01-01 03:00:00 2017-01-01 02:35:00
 7:            7 2017-01-01 03:00:00 2017-01-01 03:30:00 2017-01-01 03:05:00
 8:            8 2017-01-01 03:30:00 2017-01-01 04:00:00 2017-01-01 03:35:00
 9:            9 2017-01-01 04:00:00 2017-01-01 04:30:00                
10:           10 2017-01-01 04:30:00 2017-01-02 00:00:00

using roll in data.table join, can i force strictly one match

Answers (2)

Related Questions