Imtiaz
Imtiaz

Reputation: 137

using roll in data.table join, can i force strictly one match

I have two tables to join based on date-time fields. To recreate the scenario let us take the commercial vs sale example. We want to know which sale is related to which commercial.

Sale can only be tagged to the last commercial and only if it occurred after the commercial.

Also, if a sale happened after multiple commercials, we can only tag the sale to the last commercial; prior commercials will have a null in the join.

I am unable to get this last part. If there is a sale after multiple commercials, then all such commercials are joined with that sale; which I don't want. In my example, the sale that happened at "2017-01-01 02:05:00" should join with the commercial that aired at "2017-01-01 02:00:00" and not the prior commercials.

output of the code

library(lubridate)
library(data.table)

ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
          as.POSIXct("2017-01-02", tz = "UTC"),
          by = "30 min")

commercial <-
  data.table(
    c_row_number = 1:10,
    c_time       = ts[1:10],
    c_time_roll  = ts[1:10]
  )

sale <-
  data.table(
    s_row_number = 1:4,
    s_time       = ts[5:8] + minutes(5),
    s_time_roll  = ts[5:8] + minutes(5)
  )

setkey(commercial, c_time_roll)
setkey(sale, s_time_roll)

tbl_joined <- sale[commercial, roll = -Inf] # , mult = 'last']

Any idea how can we get NAs where c_row_number is 1, 2, 3, and 4. Thank you.

Upvotes: 2

Views: 370

Answers (2)

Alexis
Alexis

Reputation: 5069

If your commercial times are sorted, or you can sort them, then you can use a non-equi join with a helper column with shifted times:

library(lubridate)
library(data.table)

ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
          as.POSIXct("2017-01-02", tz = "UTC"),
          by = "30 min")

commercial <-
  data.table(
    c_row_number = 1:10,
    c_time       = ts[1:10],
    c_next_time  = shift(ts[1:10], type = "lead", fill = max(ts))
  )

sale <-
  data.table(
    s_row_number = 1:4,
    s_time       = ts[5:8] + minutes(5),
    s_time_join  = ts[5:8] + minutes(5)
  )

tbl_joined <- sale[commercial, on = .(s_time_join >= c_time, s_time_join < c_next_time)]

And if you want to use this idiom:

commercial[, s_time := sale[.SD,
                            .(s_time),
                            on = .(s_time_join >= c_time, s_time_join < c_next_time)]]
print(commercial)
    c_row_number              c_time         c_next_time              s_time
 1:            1 2017-01-01 00:00:00 2017-01-01 00:30:00                <NA>
 2:            2 2017-01-01 00:30:00 2017-01-01 01:00:00                <NA>
 3:            3 2017-01-01 01:00:00 2017-01-01 01:30:00                <NA>
 4:            4 2017-01-01 01:30:00 2017-01-01 02:00:00                <NA>
 5:            5 2017-01-01 02:00:00 2017-01-01 02:30:00 2017-01-01 02:05:00
 6:            6 2017-01-01 02:30:00 2017-01-01 03:00:00 2017-01-01 02:35:00
 7:            7 2017-01-01 03:00:00 2017-01-01 03:30:00 2017-01-01 03:05:00
 8:            8 2017-01-01 03:30:00 2017-01-01 04:00:00 2017-01-01 03:35:00
 9:            9 2017-01-01 04:00:00 2017-01-01 04:30:00                <NA>
10:           10 2017-01-01 04:30:00 2017-01-02 00:00:00                <NA>

Upvotes: 1

MichaelChirico
MichaelChirico

Reputation: 34763

There's no way to do this directly -- x[i] uses i to look up rows in x. mult is used for the reverse -- when multiple rows in x match to a single row in i. Here, multiple rows in i match to a single row in x.

Your best bet then, is to operate post-join on the resulting table. For example, to drop those rows, you could use unique:

unique(sale[commercial, roll = -Inf], by = 's_row_number', fromLast = TRUE)
#    s_row_number              s_time         s_time_roll c_row_number
# 1:            1 2017-01-01 02:05:00 2017-01-01 02:00:00            5
# 2:            2 2017-01-01 02:35:00 2017-01-01 02:30:00            6
# 3:            3 2017-01-01 03:05:00 2017-01-01 03:00:00            7
# 4:            4 2017-01-01 03:35:00 2017-01-01 03:30:00            8
# 5:           NA                <NA> 2017-01-01 04:30:00           10
#                 c_time
# 1: 2017-01-01 02:00:00
# 2: 2017-01-01 02:30:00
# 3: 2017-01-01 03:00:00
# 4: 2017-01-01 03:30:00
# 5: 2017-01-01 04:30:00

I suspect you're creating {s,c}_row_number just for this task; to do so without those columns, you could do:

sale[commercial, roll = -Inf][order(-c_time)][rowid(s_time) == 1L]

We sort in reverse by c_time to make sure rowid gets the most recent value.

Note that in both cases, one of the is.na(s_time) rows has been dropped.

Hopefully this gets you going in the right direction.

Upvotes: 1

Related Questions