Reputation: 7725
I have two dataframes, df1
and df2
. I want to join these dataframes on id
and date
so that each observation in df2
is mapped to df1
based on date range. There are no exact matches on id
and date
, so I have to infer the correct match based on where the date
in df2
falls relative to the date
in df1
for each id
. Let's get generate the data for a toy example.
library(tidyverse)
df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2),
date = lubridate::date(c("2020-01-01",
"2020-03-01",
"2020-05-01",
"2020-02-01",
"2020-04-01",
"2020-06-01")),
thing1 = c("a", "b", "c",
"a", "b", "c"))
df2 <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
date = lubridate::date(c("2020-01-05",
"2020-02-20",
"2020-03-05",
"2020-03-10",
"2020-05-05",
"2020-02-05",
"2020-02-10",
"2020-04-05",
"2020-04-10",
"2020-06-05")),
thing2 = c("v", "w", "x", "y", "z",
"v", "w", "x", "y", "z"))
df1
# id date thing1
#1 1 2020-01-01 a
#2 1 2020-03-01 b
#3 1 2020-05-01 c
#4 2 2020-02-01 a
#5 2 2020-04-01 b
#6 2 2020-06-01 c
df2
# id date thing2
#1 1 2020-01-05 v
#2 1 2020-02-20 w
#3 1 2020-03-05 x
#4 1 2020-03-10 y
#5 1 2020-05-05 z
#6 2 2020-02-05 v
#7 2 2020-02-10 w
#8 2 2020-04-05 x
#9 2 2020-04-10 y
#10 2 2020-06-05 z
Here's what I want:
want <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), date = structure(c(18266,
18312, 18326, 18331, 18387, 18297, 18302, 18357, 18362, 18418
), class = "Date"), thing2 = structure(c(1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 4L, 5L), .Label = c("v", "w", "x", "y", "z"), class = "factor"),
thing1 = structure(c(1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
want
# id date thing2 thing1
#1 1 2020-01-05 v a
#2 1 2020-02-20 w a
#3 1 2020-03-05 x b
#4 1 2020-03-10 y b
#5 1 2020-05-05 z c
#6 2 2020-02-05 v a
#7 2 2020-02-10 w a
#8 2 2020-04-05 x b
#9 2 2020-04-10 y b
#10 2 2020-06-05 z c
In this example, id==1
on 2020-02-20
in df2
maps to thing1==a
because "a" is the reigning value for id==1
between the first value "2020-01-01" and the second value "2020-03-01".
Problem 1: I can get almost there by using complete()
on df1
to fill in all of the missing dates by id
and then join with df2
. One problem is that there is no exact match for dates that fall out of the min/max range for each id
. For instance, id==1
on "2020-05-05" should map to a value of "c" because it falls after "2020-05-01" in df1
, but there's no match in the completed/joined dataframe.
df1_ <- df1 %>%
group_by(id) %>%
complete(id, date = seq.Date(min(date), max(date), by="day")) %>%
fill(thing1)
df1_df2 <- df2 %>%
left_join(df1_, by=c("id", "date"))
df1_df2
# id date thing2 thing1
#1 1 2020-01-05 v a
#2 1 2020-02-20 w a
#3 1 2020-03-05 x b
#4 1 2020-03-10 y b
#5 1 2020-05-05 z <NA>
#6 2 2020-02-05 v a
#7 2 2020-02-10 w a
#8 2 2020-04-05 x b
#9 2 2020-04-10 y b
#10 2 2020-06-05 z <NA>
Problem 2: The bigger problem for me is that this approach does not scale. My real df1
has several date observations each for >500k ids. Creating a complete dataframe by person between big data ranges creates millions of records quickly and takes too long.
How would you go about solving this problem more efficiently?
Upvotes: 0
Views: 1234
Reputation: 7725
A tip and a few links from @r2evans helped me get to a solution in {dplyr}
. While this works for my toy example, I run out of vector memory for my real use case. I need to look into the {dbplyr}
option. Tips appreciated.
First, get an end date column to make date ranges:
df1_ <-
df1 %>%
group_by(id) %>%
mutate(dateEnd = lead(date, 1),
dateEnd = if_else(is.na(dateEnd),
lubridate::date(Sys.Date()),
dateEnd))
Then use the {fuzzyjoin}
package to join by date range.
library(fuzzyjoin)
fj <- df2 %>%
fuzzy_left_join(df1_,
by = c("id" = "id",
"date" = "date",
"date" = "dateEnd"),
match_fun = list(`==`, `>=`, `<`)
) %>%
rename(id = id.x, date = date.x) %>%
select(id, date, thing1, thing2)
Upvotes: 1
Reputation: 160407
Here's a demonstration of using dbplyr
to affect non-equal joins. I'll use a temporary (in-memory) SQLite database, which has one side-effect of converting dates to numbers; this is both reversible (if you use SQLite) and should not be a factor for DBMSes that distinguish between Date
and numeric
.
Setup:
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# rename them in the db for unambiguity here in code
DBI::dbExecute(con, "drop table if exists table1")
DBI::dbExecute(con, "drop table if exists table2")
library(dplyr)
library(dbplyr)
copy_to(con, df1, "table1")
copy_to(con, df2, "table2")
The work:
tbl1 <- tbl(con, "table1")
tbl2 <- tbl(con, "table2")
tbl1 %>%
group_by(id) %>%
transmute(id, date1 = date, date2 = lead(date - 1, default = 99999999), thing1) %>%
left_join(tbl2, ., sql_on = "LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2)") %>%
select(-date1, -date2, -id.y) %>%
rename(id = id.x) %>%
collect()
# # A tibble: 10 x 4
# id date thing2 thing1
# <dbl> <dbl> <chr> <chr>
# 1 1 18266 v a
# 2 1 18312 w a
# 3 1 18326 x b
# 4 1 18331 y b
# 5 1 18387 z c
# 6 2 18297 v a
# 7 2 18302 w a
# 8 2 18357 x b
# 9 2 18362 y b
# 10 2 18418 z c
Edit using a postgres container.
tbl1 %>%
group_by(id) %>%
transmute(id, date1 = date, date2 = lead(date), thing1) %>%
mutate(date2 = if_else(is.na(date2), as.Date("2099-01-01"), date2)) %>%
left_join(tbl2, ., sql_on = '"LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2)') %>%
select(-date1, -date2, -id.y) %>%
rename(id = id.x) %>%
collect()
# # A tibble: 10 x 4
# id date thing2 thing1
# <dbl> <date> <chr> <chr>
# 1 1 2020-01-05 v a
# 2 1 2020-02-20 w a
# 3 1 2020-03-05 x b
# 4 1 2020-03-10 y b
# 5 1 2020-05-05 z c
# 6 2 2020-02-05 v a
# 7 2 2020-02-10 w a
# 8 2 2020-04-05 x b
# 9 2 2020-04-10 y b
# 10 2 2020-06-05 z c
Some of the problems I ran into with this:
date-math is something I should have expected problems with, since most DBMSes have specific functions for date/time arithmetic, and I made no effort to address that (I was being lazy with SQLite's numeric date-alias); additionally, I don't know how to get lead(..., default=)
working, so I trimmed the lead(...)
call and added a mutate
;
postgres was taking issue with the fact that dbplyr
identifier-quotes all of the tables name (e.g., FROM "table" AS "LHS"
, yet my sql_on
was using unquoted LHS
; when I changed the sql_on
to have quoted table names, things stopped breaking; you can see how this query pans out by replacing collect()
with show_query()
, showing the change from
ON (LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2))
to
ON ("LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2))
Upvotes: 2