Luke
Luke

Reputation: 11

R function to join two tables if date in table 1 is earlier than date in table 2

this question is about Tibbles in tidyverse package of R. I have created the below example to represent my data .

Tibble 'ab' is a list of people (column a) and a date that something specific happened (column b) e.g. received a vaccine.
Tibble 'cd' is a list of the same people (column c) but includes all event dates recorded for that person (column d).

I would like a table similar to Tibble 'cd' but only where the date (column d) is earlier than the date for each person in Tibble 'ab' (column b)(e.g. date receiving a vaccine). In other words, excluding all events that happen after the vaccine date in Tibble 'ab', for each person.

a <- 1001:1007
b <- c ("20/03/1998", "04/05/2012", "06/07/1989", "14/08/1990", "30/01/1978", "09/09/2011","23/11/2005")

ab <- tibble (a,b) 
ab_date <- ab %>% mutate (b = as.Date (b, format = "%d/%m/%Y")) 

c <- c( 1001, 1001, 1001, 1001, 1002, 1002, 1002)
d <- c( "12/02/1996", "14/05/1997", "20/03/1998", "23/03/1998", "10/10/1992", "04/05/2012", "13/11/2020")

cd <- tibble (c,d)
cd_date <- cd  %>% mutate (d = as.Date (d, format = "%d/%m/%Y"))

For this example it would look like this:

e <- c(1001, 1001, 1002)
f <- c( "12/02/1996","14/05/1997", "10/10/1992") 

ef <- tibble (e,f)
ef_date <- ef %>% mutate (f = as.Date (f, format = "%d/%m/%Y"))

I have tried a few things with inner_join and fuzzy_join but am not really sure how to apply these for what I am trying to do!

Any help would be greatly appreciated!

Thanks

Luke

Upvotes: 1

Views: 405

Answers (1)

Ben
Ben

Reputation: 30474

With fuzzyjoin you can try the following. Upon joining, you would want columns a and c to be equal, and keep dates where b comes after d. Using an inner join, you will only keep matching rows in both data frames.

library(tidyverse)
library(fuzzyjoin)

ab_date %>%
  fuzzy_inner_join(cd_date,
                   by = c("a" = "c", "b" = "d"),
                   match_fun = c(`==`, `>`)) %>%
  select(c, d)

Output

      c d         
  <dbl> <date>    
1  1001 1996-02-12
2  1001 1997-05-14
3  1002 1992-10-10

Upvotes: 1

Related Questions