Marco Mello
Marco Mello

Reputation: 175

Merge dataframes based on interval condition

I have a dataframe like this

id start        end
1  20/06/88     24/07/89
1  27/07/89     13/04/93
1  14/04/93     6/09/95
2  3/01/92      11/02/94
2  30/03/94     16/04/96
2  17/04/96     18/08/97

that I would like to merge with this other dataframe

id date
1  26/08/88   
2  10/05/96    

The resulting merged dataframe should look like this

id start        end         date
1  20/06/88     24/07/89    26/06/88
1  27/07/89     13/04/93    NA
1  14/04/93     6/09/95     NA
2  3/01/92      11/02/94    NA
2  30/03/94     16/04/96    NA
2  17/04/96     18/08/97    10/05/96

In practice I want to merge the two dataframes based on id and on the fact that date must lie within the interval spanned by the start and end vars of the first dataframe.

Do you have any suggestion on how to do this? I tried to use the fuzzyjoin package, but I have some memory issue..

Many thanks to everyone

Upvotes: 0

Views: 230

Answers (2)

Wietze314
Wietze314

Reputation: 6020

You can use sqldf for complex joins:


require(sqldf)

sqldf("SELECT df1.*,df2.date,df2.id as id2
      FROM df1
      LEFT JOIN df2 
      ON df1.id = df2.id AND
      df1.start < df2.date AND
      df1.end > df2.date")

Upvotes: 1

Maurits Evers
Maurits Evers

Reputation: 50728

Might be a dupe, will remove when I found a good target. In the meantime, we could use fuzzyjoin

library(tidyverse)
library(fuzzyjoin)
df1 %>%
    mutate_at(2:3, as.Date, "%d/%m/%y") %>%
    fuzzy_left_join(
        df2 %>% mutate(date = as.Date(date, "%d/%m/%y")),
        by = c("id" = "id", "start" = "date", "end" = "date"),
        match_fun = list(`==`, `<`, `>`))
#  id.x      start        end id.y       date
#1    1 1988-06-20 1989-07-24    1 1988-08-26
#2    1 1989-07-27 1993-04-13   NA       <NA>
#3    1 1993-04-14 1995-09-06   NA       <NA>
#4    2 1992-01-03 1994-02-11   NA       <NA>
#5    2 1994-03-30 1996-04-16   NA       <NA>
#6    2 1996-04-17 1997-08-18    2 1996-05-10

All that remains is tidying up the id columns.


Sample data

df1 <- read.table(text = "
id start        end
1  20/06/88     24/07/89
1  27/07/89     13/04/93
1  14/04/93     6/09/95
2  3/01/92      11/02/94
2  30/03/94     16/04/96
2  17/04/96     18/08/97", header = T)

df2 <- read.table(text = "
id date
1  26/08/88
2  10/05/96   ", header = T)

Upvotes: 2

Related Questions