P Meddyyy
P Meddyyy

Reputation: 15

R - Fuzzy Inner Join on two fields, matching to a date range

I'm fairly new to R, and have been sifting through other questions all morning trying to figure this out, but can't find anything related enough or my knowledge of R is not good enough to understand some of the suggested solutions to my problem.

I have two data frames, table A with a list of non-unique identifiers and a date, and table B with the same identifier field, and a start and end date outlining a 3 month date range. In my real data, I have 1.7m records in table A, and 1.6k records in table A (as well many other fields that i'll use for the final analysis). I am expecting the vast majority of records in table A to be unnecessary.

What I want to achieve is to join the two tables together, joining on the identifier, and then only joining if the date in table A falls inside the date range in table B. I want this as an inner join so I lose the unnecessary data.

Example tables:

a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))

b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))

Expected result (we lose person 3 because the date falls outside the range, person 2 has two records because they had two entries in table b with corresponding dates):

c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))

I've been basing my solution on this web page https://exploratory.io/note/exploratory/How-to-join-two-data-frames-with-date-ranges-moq8hEQ6, however my example adds in the requirement to also join the identifier.

My solution that logically makes sense to me based on my (limited) ability in R:

Joined <- fuzzy_inner_join(b, a, by = c("numberb"="numbera", "datex"="date1", "datex"="date2"),
                           match_fun = list("=", ">=", "<="))

However I get this error message:

Error in which(m) : argument to 'which' is not logical

Thank you in advance for any help here :)

Upvotes: 0

Views: 1311

Answers (3)

Evan Cutler Anway
Evan Cutler Anway

Reputation: 71

data.table supports non-equi joins that are the fastest I've found. The syntax is a bit different from the tidyverse but I think it's worth it for these more complicated joins.

a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))

b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))

c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))

library(data.table)

dt_a <- setDT(a) # change the data frame to a data table

dt_b <- setDT(b) # change the data frame to a data table

Joined <- dt_b[ # take table b
  dt_a, # join table a
  .(numberb, datex = x.datex, numbera, date1, date2), # selecting these columns
  on = .(numberb == numbera, datex>=date1, datex<=date2), # joining on these columns
  nomatch = NULL # remove non-matches for an inner join
] |> 
  setDF() # change it back to a data frame for comparison

identical(c, Joined) # TRUE

Upvotes: 2

akshaymoorthy
akshaymoorthy

Reputation: 346

Alternate solution using data.table:

library(data.table)
dt <- merge(a,b, by.x = "numbera", by.y = "numberb")
setDT(dt)
dt[date1 <= datex & date2 >= datex]

   numbera      date1      date2      datex
1:       1 2021-04-10 2021-07-10 2021-05-16
2:       1 2022-02-17 2022-05-17 2022-03-17
3:       2 2021-06-21 2021-09-21 2021-08-01
4:       2 2021-06-21 2021-09-21 2021-08-03

Upvotes: 1

Carl
Carl

Reputation: 7540

With backticks for the match_fun and == for the number:

library(tidyverse)
library(fuzzyjoin)

a <- data.frame(
  numbera = c("1", "2", "3", "1"),
  date1 = as.Date(c("10/04/2021", "21/06/2021", "02/10/2021", "17/02/2022"),
    format = "%d/%m/%Y"
  ),
  date2 = as.Date(c("10/07/2021", "21/09/2021", "02/01/2022", "17/05/2022"),
    format = "%d/%m/%Y"
  )
)

b <- data.frame(
  numberb = c("1", "2", "2", "3", "1"),
  datex = as.Date(c("16/05/2021", "01/08/2021", "03/08/2021", "02/09/2021", "17/03/2022"),
    format = "%d/%m/%Y"
  )
)

fuzzy_inner_join(b, a,
  by = c("numberb" = "numbera", "datex" = "date1", "datex" = "date2"),
  match_fun = list(`==`, `>=`, `<=`)
)
#>   numberb      datex numbera      date1      date2
#> 1       1 2021-05-16       1 2021-04-10 2021-07-10
#> 2       2 2021-08-01       2 2021-06-21 2021-09-21
#> 3       2 2021-08-03       2 2021-06-21 2021-09-21
#> 4       1 2022-03-17       1 2022-02-17 2022-05-17

Created on 2022-07-08 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions