Reputation: 49
I am trying to join two data tables with a data.table non-equi join. However, some of the dates in table 1, do not fall within a date range in table 2. What I want with these "nomatch", is to join them to the nearest date range. Lets says there are two ranges; 2002-01-01 to 2004-01-01 and another range; 2010-01-01 to 2012-01-01. The date 2009-01-01 will be out of range, however I want this date to be joined to the 2011-2012 range, as the year 2009 is closer to 2011, than the year 2004. I hope my question makes sense.
Example data
library(data.table)
# data.table with single dates
id <- c("xxx","xxx","xxx","yyy","yyy","yyy","zzz","zzz","zzz")
date <- as.Date(c("2019-02-01","2020-05-01","2021-02-01","2020-01-01","2020-07-01","2020-08-01","2021-03-01","2022-10-01","2023-11-16"))
single_date_dt <- data.table(id,date)
# data.table with range dates
id <- c("xxx","xxx","yyy","zzz","zzz")
weight_kg <- c(10,13,20,45,65)
start_date <- as.Date(c("2019-01-01","2021-01-01","2020-01-01","2021-01-01","2023-01-01"))
end_date <- as.Date(c("2019-12-31","2021-12-31","2021-01-01","2021-12-31","2023-12-31"))
range_date_dt <- data.table(id,weight_kg,start_date,end_date)
dates_joined <- range_date_dt[single_date_dt, on = .(id, start_date <= date, end_date >= date), nomatch = NA]
There are two NA's in the "dates_joined" data.table. The first NA should be "10" since 2020-05-01 is closer to 2019-12-31 than 2021-01-01, and the second NA should be 65, since 2022-10-01 is closer to 2023-01-01 than 2021-12-31.
The real dataset consist of several million observations with around 5000 "out of range" dates.
Thanks
Upvotes: 0
Views: 71
Reputation: 17011
I think we can safely assume the ranges won't overlap for the same id
, so we can perform the rolling join using a melt
ed range_date_dt
:
melt(range_date_dt, c("id", "weight_kg"), value.name = "date")[
,variable := NULL
][single_date_dt, on = .(id, date), roll = "nearest"]
#> id weight_kg date
#> 1: xxx 10 2019-02-01
#> 2: xxx 10 2020-05-01
#> 3: xxx 13 2021-02-01
#> 4: yyy 20 2020-01-01
#> 5: yyy 20 2020-07-01
#> 6: yyy 20 2020-08-01
#> 7: zzz 45 2021-03-01
#> 8: zzz 65 2022-10-01
#> 9: zzz 65 2023-11-16
Upvotes: 2
Reputation: 27762
Here is a solution (using data.table
) where you first create a long version of your ranges, with a single date per row. Using this table, you can easily perform a rolling join to the nearest date.
# create a long data.table with days and theit id/weight_kg
range.long <- range_date_dt[ , list(id, weight_kg, date = seq(start_date, end_date, by = "day")), by = 1:nrow(range_date_dt)]
# perform rolling join (by reference) to nearest date
single_date_dt[, weight_kg := range.long[.SD, weight_kg, on = .(id, date), roll = "nearest"]][]
# id date weight_kg
# 1: xxx 2019-02-01 10
# 2: xxx 2020-05-01 10
# 3: xxx 2021-02-01 13
# 4: yyy 2020-01-01 20
# 5: yyy 2020-07-01 20
# 6: yyy 2020-08-01 20
# 7: zzz 2021-03-01 45
# 8: zzz 2022-10-01 65
# 9: zzz 2023-11-16 65
Upvotes: 2