Reputation: 137
I have two tables to join based on date-time fields. To recreate the scenario let us take the commercial vs sale example. We want to know which sale is related to which commercial.
Sale can only be tagged to the last commercial and only if it occurred after the commercial.
Also, if a sale happened after multiple commercials, we can only tag the sale to the last commercial; prior commercials will have a null in the join.
I am unable to get this last part. If there is a sale after multiple commercials, then all such commercials are joined with that sale; which I don't want. In my example, the sale that happened at "2017-01-01 02:05:00" should join with the commercial that aired at "2017-01-01 02:00:00" and not the prior commercials.
library(lubridate)
library(data.table)
ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
as.POSIXct("2017-01-02", tz = "UTC"),
by = "30 min")
commercial <-
data.table(
c_row_number = 1:10,
c_time = ts[1:10],
c_time_roll = ts[1:10]
)
sale <-
data.table(
s_row_number = 1:4,
s_time = ts[5:8] + minutes(5),
s_time_roll = ts[5:8] + minutes(5)
)
setkey(commercial, c_time_roll)
setkey(sale, s_time_roll)
tbl_joined <- sale[commercial, roll = -Inf] # , mult = 'last']
Any idea how can we get NAs where c_row_number is 1, 2, 3, and 4. Thank you.
Upvotes: 2
Views: 370
Reputation: 5069
If your commercial times are sorted, or you can sort them, then you can use a non-equi join with a helper column with shifted times:
library(lubridate)
library(data.table)
ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
as.POSIXct("2017-01-02", tz = "UTC"),
by = "30 min")
commercial <-
data.table(
c_row_number = 1:10,
c_time = ts[1:10],
c_next_time = shift(ts[1:10], type = "lead", fill = max(ts))
)
sale <-
data.table(
s_row_number = 1:4,
s_time = ts[5:8] + minutes(5),
s_time_join = ts[5:8] + minutes(5)
)
tbl_joined <- sale[commercial, on = .(s_time_join >= c_time, s_time_join < c_next_time)]
And if you want to use this idiom:
commercial[, s_time := sale[.SD,
.(s_time),
on = .(s_time_join >= c_time, s_time_join < c_next_time)]]
print(commercial)
c_row_number c_time c_next_time s_time
1: 1 2017-01-01 00:00:00 2017-01-01 00:30:00 <NA>
2: 2 2017-01-01 00:30:00 2017-01-01 01:00:00 <NA>
3: 3 2017-01-01 01:00:00 2017-01-01 01:30:00 <NA>
4: 4 2017-01-01 01:30:00 2017-01-01 02:00:00 <NA>
5: 5 2017-01-01 02:00:00 2017-01-01 02:30:00 2017-01-01 02:05:00
6: 6 2017-01-01 02:30:00 2017-01-01 03:00:00 2017-01-01 02:35:00
7: 7 2017-01-01 03:00:00 2017-01-01 03:30:00 2017-01-01 03:05:00
8: 8 2017-01-01 03:30:00 2017-01-01 04:00:00 2017-01-01 03:35:00
9: 9 2017-01-01 04:00:00 2017-01-01 04:30:00 <NA>
10: 10 2017-01-01 04:30:00 2017-01-02 00:00:00 <NA>
Upvotes: 1
Reputation: 34763
There's no way to do this directly -- x[i]
uses i
to look up rows in x
. mult
is used for the reverse -- when multiple rows in x
match to a single row in i
. Here, multiple rows in i
match to a single row in x
.
Your best bet then, is to operate post-join on the resulting table. For example, to drop those rows, you could use unique
:
unique(sale[commercial, roll = -Inf], by = 's_row_number', fromLast = TRUE)
# s_row_number s_time s_time_roll c_row_number
# 1: 1 2017-01-01 02:05:00 2017-01-01 02:00:00 5
# 2: 2 2017-01-01 02:35:00 2017-01-01 02:30:00 6
# 3: 3 2017-01-01 03:05:00 2017-01-01 03:00:00 7
# 4: 4 2017-01-01 03:35:00 2017-01-01 03:30:00 8
# 5: NA <NA> 2017-01-01 04:30:00 10
# c_time
# 1: 2017-01-01 02:00:00
# 2: 2017-01-01 02:30:00
# 3: 2017-01-01 03:00:00
# 4: 2017-01-01 03:30:00
# 5: 2017-01-01 04:30:00
I suspect you're creating {s,c}_row_number
just for this task; to do so without those columns, you could do:
sale[commercial, roll = -Inf][order(-c_time)][rowid(s_time) == 1L]
We sort in reverse by c_time
to make sure rowid
gets the most recent value.
Note that in both cases, one of the is.na(s_time)
rows has been dropped.
Hopefully this gets you going in the right direction.
Upvotes: 1