join data frame based on date ranges not exact values

Question

I have two dataframes, df1 and df2. I want to join these dataframes on id and date so that each observation in df2 is mapped to df1 based on date range. There are no exact matches on id and date, so I have to infer the correct match based on where the date in df2 falls relative to the date in df1 for each id. Let's get generate the data for a toy example.

library(tidyverse)

df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2),
                  date = lubridate::date(c("2020-01-01",
                                           "2020-03-01",
                                           "2020-05-01",
                                         
                                           "2020-02-01",
                                           "2020-04-01",
                                           "2020-06-01")),
                  thing1 = c("a", "b", "c", 
                             "a", "b", "c"))

df2 <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
                  date = lubridate::date(c("2020-01-05",
                                           "2020-02-20",
                                           "2020-03-05",
                                           "2020-03-10",
                                           "2020-05-05",
                                           
                                           "2020-02-05",
                                           "2020-02-10",
                                           "2020-04-05",
                                           "2020-04-10",
                                           "2020-06-05")),
                  thing2 = c("v", "w", "x", "y", "z", 
                             "v", "w", "x", "y", "z"))

df1
#  id       date thing1
#1  1 2020-01-01      a
#2  1 2020-03-01      b
#3  1 2020-05-01      c
#4  2 2020-02-01      a
#5  2 2020-04-01      b
#6  2 2020-06-01      c

df2
#   id       date thing2
#1   1 2020-01-05      v
#2   1 2020-02-20      w
#3   1 2020-03-05      x
#4   1 2020-03-10      y
#5   1 2020-05-05      z
#6   2 2020-02-05      v
#7   2 2020-02-10      w
#8   2 2020-04-05      x
#9   2 2020-04-10      y
#10  2 2020-06-05      z

Here's what I want:

want <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), date = structure(c(18266, 
18312, 18326, 18331, 18387, 18297, 18302, 18357, 18362, 18418
), class = "Date"), thing2 = structure(c(1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 4L, 5L), .Label = c("v", "w", "x", "y", "z"), class = "factor"), 
    thing1 = structure(c(1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 
    3L), .Label = c("a", "b", "c"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))
want
#   id       date thing2 thing1
#1   1 2020-01-05      v      a
#2   1 2020-02-20      w      a
#3   1 2020-03-05      x      b
#4   1 2020-03-10      y      b
#5   1 2020-05-05      z      c
#6   2 2020-02-05      v      a
#7   2 2020-02-10      w      a
#8   2 2020-04-05      x      b
#9   2 2020-04-10      y      b
#10  2 2020-06-05      z      c

In this example, id==1 on 2020-02-20 in df2 maps to thing1==a because "a" is the reigning value for id==1 between the first value "2020-01-01" and the second value "2020-03-01".

Problem 1: I can get almost there by using complete() on df1 to fill in all of the missing dates by id and then join with df2. One problem is that there is no exact match for dates that fall out of the min/max range for each id. For instance, id==1 on "2020-05-05" should map to a value of "c" because it falls after "2020-05-01" in df1, but there's no match in the completed/joined dataframe.

df1_ <- df1 %>% 
  group_by(id) %>% 
  complete(id, date = seq.Date(min(date), max(date), by="day")) %>% 
  fill(thing1)

df1_df2 <- df2 %>%
  left_join(df1_, by=c("id", "date"))

df1_df2 
#   id       date thing2 thing1
#1   1 2020-01-05      v      a
#2   1 2020-02-20      w      a
#3   1 2020-03-05      x      b
#4   1 2020-03-10      y      b
#5   1 2020-05-05      z   
#6   2 2020-02-05      v      a
#7   2 2020-02-10      w      a
#8   2 2020-04-05      x      b
#9   2 2020-04-10      y      b
#10  2 2020-06-05      z

Problem 2: The bigger problem for me is that this approach does not scale. My real df1 has several date observations each for >500k ids. Creating a complete dataframe by person between big data ranges creates millions of records quickly and takes too long.

How would you go about solving this problem more efficiently?

r2evans · Accepted Answer

Here's a demonstration of using dbplyr to affect non-equal joins. I'll use a temporary (in-memory) SQLite database, which has one side-effect of converting dates to numbers; this is both reversible (if you use SQLite) and should not be a factor for DBMSes that distinguish between Date and numeric.

Setup:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# rename them in the db for unambiguity here in code
DBI::dbExecute(con, "drop table if exists table1")
DBI::dbExecute(con, "drop table if exists table2")
library(dplyr)
library(dbplyr)
copy_to(con, df1, "table1")
copy_to(con, df2, "table2")

The work:

tbl1 <- tbl(con, "table1")
tbl2 <- tbl(con, "table2")
tbl1 %>%
  group_by(id) %>%
  transmute(id, date1 = date, date2 = lead(date - 1, default = 99999999), thing1) %>%
  left_join(tbl2, ., sql_on = "LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2)") %>%
  select(-date1, -date2, -id.y) %>%
  rename(id = id.x) %>%
  collect()
# # A tibble: 10 x 4
#       id  date thing2 thing1
#         
#  1     1 18266 v      a     
#  2     1 18312 w      a     
#  3     1 18326 x      b     
#  4     1 18331 y      b     
#  5     1 18387 z      c     
#  6     2 18297 v      a     
#  7     2 18302 w      a     
#  8     2 18357 x      b     
#  9     2 18362 y      b     
# 10     2 18418 z      c

Edit using a postgres container.

tbl1 %>%
  group_by(id) %>%
  transmute(id, date1 = date, date2 = lead(date), thing1) %>%
  mutate(date2 = if_else(is.na(date2), as.Date("2099-01-01"), date2)) %>%
  left_join(tbl2, ., sql_on = '"LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2)') %>%
  select(-date1, -date2, -id.y) %>%
  rename(id = id.x) %>%
  collect()
# # A tibble: 10 x 4
#       id date       thing2 thing1
#             
#  1     1 2020-01-05 v      a     
#  2     1 2020-02-20 w      a     
#  3     1 2020-03-05 x      b     
#  4     1 2020-03-10 y      b     
#  5     1 2020-05-05 z      c     
#  6     2 2020-02-05 v      a     
#  7     2 2020-02-10 w      a     
#  8     2 2020-04-05 x      b     
#  9     2 2020-04-10 y      b     
# 10     2 2020-06-05 z      c

Some of the problems I ran into with this:

date-math is something I should have expected problems with, since most DBMSes have specific functions for date/time arithmetic, and I made no effort to address that (I was being lazy with SQLite's numeric date-alias); additionally, I don't know how to get lead(..., default=) working, so I trimmed the lead(...) call and added a mutate;
postgres was taking issue with the fact that dbplyr identifier-quotes all of the tables name (e.g., FROM "table" AS "LHS", yet my sql_on was using unquoted LHS; when I changed the sql_on to have quoted table names, things stopped breaking; you can see how this query pans out by replacing collect() with show_query(), showing the change from
```
ON (LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2))
```
to
```
ON ("LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2))
```

join data frame based on date ranges not exact values

Answers (2)

Related Questions