KKhosra
KKhosra

Reputation: 173

Merge data based on nearest date R

How do I jeft.join 2 data frames based on the nearest date? I currently have the script written so that it joins by the exact date, but I would prefer to do it by nearest date in case there is not an exact match.

This is what I currently have:

MASTER_DATABASE <- left_join(ptnamesMID, CTDB, by = c("LAST_NAME", "FIRST_NAME", "Measure_date" = "VISIT_DATE"))

Upvotes: 2

Views: 4584

Answers (2)

Uwe
Uwe

Reputation: 42544

The rolling joins in the data.table have a parameter roll = "nearest" which does probably what the OP expects.

Unfortunately, the OP has failed to provide sample data so I had to make up my own sample data.

Create sample datasets

set.seed(123L)
dates <- seq(as.Date("2017-01-01"), as.Date("2017-12-31"), by = "1 day")
ptnamesMID <- data.frame(ID = 1:10, Measure_date = sample(dates, 10L))
CTDB <- data.frame(ID = sample.int(10, 30L, TRUE), VISIT_DATE = sample(dates, 30L, TRUE))

Here, ID is used in place of LAST_NAME and FIRST_NAME for simplification. ptnamesMID consists of 10 rows while CTDB has 30 rows of randomly sampled visit dates.

Rolling join

library(data.table)
# coerce to data.table and append join columns to preserve the original columns 
setDT(CTDB)[, join_date := VISIT_DATE]
setDT(ptnamesMID)[, join_date := Measure_date]
# rolling join
CTDB[ptnamesMID, on = .(ID, join_date), roll = "nearest"]
    ID VISIT_DATE  join_date Measure_date
 1:  1 2017-06-20 2017-04-15   2017-04-15
 2:  2 2017-05-17 2017-10-14   2017-10-14
 3:  3 2017-06-10 2017-05-29   2017-05-29
 4:  4 2017-10-17 2017-11-16   2017-11-16
 5:  5 2017-06-13 2017-12-06   2017-12-06
 6:  6 2017-02-16 2017-01-17   2017-01-17
 7:  7 2017-07-24 2017-07-09   2017-07-09
 8:  8 2017-10-23 2017-12-28   2017-12-28
 9:  9 2017-02-20 2017-07-16   2017-07-16
10: 10 2017-08-31 2017-06-12   2017-06-12

In data.table syntax, CTDB[ptnamesMID, ...] is equivalent to a left join of ptnamesMID with CTDB, i.e., all rows of of ptnamesMID are kept in the result set.

Upvotes: 6

Ryan John
Ryan John

Reputation: 1430

Without an example it's hard to help your use case. I'd try out a package by David Robinson:

https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf

Here is the example for interval_join:

if (requireNamespace("IRanges", quietly = TRUE)) {
x1 <- data.frame(id1 = 1:3, start = c(1, 5, 10), end = c(3, 7, 15))
x2 <- data.frame(id2 = 1:3, start = c(2, 4, 16), end = c(4, 8, 20))
interval_inner_join(x1, x2)
# Allow them to be separated by a gap with a maximum:
interval_inner_join(x1, x2, maxgap = 1) # let 1 join with 2
interval_inner_join(x1, x2, maxgap = 20) # everything joins each other
# Require that they overlap by more than a particular amount
interval_inner_join(x1, x2, minoverlap = 3)
# other types of joins:
interval_full_join(x1, x2)
interval_left_join(x1, x2)
interval_right_join(x1, x2)
interval_semi_join(x1, x2)
interval_anti_join(x1, x2)
}

Upvotes: 3

Related Questions