Reputation: 195
I have a df:
df1 <- data.frame(date = c("2020-01-01", "2018-01-01"), A = c("5", NA), B = c("4", "0"))
date A B
2020-01-01 5 4
2018-01-01 <NA> 0
And a second df:
df2 <- data.frame(date = c("2020-05-16", "2018-09-23", "2017-02-02"), C = c("2", "3", "4"), D = c("9", "10", "11"))
date C D
2020-05-16 2 9
2018-09-23 3 10
2017-02-02 4 11
I want to join the two dfs such that only dates that are in df2 which are after df1's date and within 12 months are joined to df1. (while keeping df1's date).
The result of this join should hopefully look like the following:
df_result <- data.frame(date = c("2020-01-01", "2018-01-01"), A = c("5", NA), B = c("4", "0"), C = c("2", "3"), D = c("9", "10"))
date A B C D
2020-01-01 5 4 2 9
2018-01-01 <NA> 0 3 10
Joining if the dates are equal is easy using inner_join. However, I am not quite sure how to do an inner join with a condition that is not simply x = y.
Any help would be appreciated, thanks!
Upvotes: 1
Views: 1238
Reputation: 269461
Convert the dates to Date class and then use the indicated left join.
library(sqldf)
df1 <- data.frame(date = c("2020-01-01", "2018-01-01"),
A = c("5", NA), B = c("4", "0"))
df2 <- data.frame(date = c("2020-05-16", "2018-09-23", "2017-02-02"),
C = c("2", "3", "4"), D = c("9", "10", "11"))
df1$date <- as.Date(df1$date)
df2$date <- as.Date(df2$date)
sqldf("select a.*, b.C, b.D
from df1 a
left join df2 b on b.date > a.date and b.date - a.date <= 365")
## date A B C D
## 1 2020-01-01 5 4 2 9
## 2 2018-01-01 <NA> 0 3 10
In the sample data there is one match per row of df1 but if there could be several matches and we want only the least then
sqldf("select a.*, b.C, b.D, min(b.date - a.date) date_diff
from df1 a
left join df2 b on b.date > a.date and b.date - a.date <= 365
group by a.rowid
order by a.rowid")[-6]
Regarding the comment below this left joins df1
and df2
on the same conditions but for each df2
row in the join keeps only the df1
row that is nearest to it giving mm
. Then it left joins df1
to mm
to ensure that all rows of df1
are represented.
library(sqldf)
# added third row to df1 as per comment
df1 <- data.frame(date = c("2020-01-01", "2018-01-01", "2020-02-02"),
A = c(5, NA, 1), B = c(4, 0, 1))
df2 <- data.frame(date = c("2020-05-16", "2018-09-23", "2017-02-02"),
C = c(2, 3, 4), D = c(9, 10, 11))
df1$date <- as.Date(df1$date)
df2$date <- as.Date(df2$date)
mm <- sqldf("select a.*, b.C, b.D, min(b.date - a.date) date_diff
from df1 a
left join df2 b on b.date > a.date and b.date - a.date <= 365
group by b.rowid")
sqldf("select a.*, b.C, b.D
from df1 a
left join mm b using(date)")
date A B C D
## 1 2020-01-01 5 4 NA NA
## 2 2018-01-01 NA 0 3 10
## 3 2020-02-02 1 1 2 9
Upvotes: 3
Reputation: 4344
a dplyr solution though G. Grothendiecks answer it top notch!!
library(dplyr)
df1 %>%
# generate all possible combinations (possibly RAM expensive)
dplyr::full_join(df2, by = character()) %>%
# convert specific sample data columns to date
dplyr::mutate(across(contains("date"), as.Date)) %>%
# subset the data (you could use other functions here but this is exact enough as we are lookin on months)
dplyr::filter(date.y > date.x & date.y < (date.x + 365.25)) %>%
# rename and select columns
dplyr::select(date = 1, 2, 3, 5, 6)
date A B C D
1 2020-01-01 5 4 2 9
2 2018-01-01 <NA> 0 3 10
about you question: it is possible in this dplyr pipe to solve the aditional task you requestes though it is not the best solution.
# modified df2 so it does not match any of df1 in your criteria
df1 <- data.frame(date = c("2020-01-01", "2018-01-01"), A = c("5", NA), B = c("4", "0"))
df2 <- data.frame(date = c("2017-02-02"), C = c("2"), D = c("11"))
library(dplyr)
df1 %>%
# generate all possible combinations (possibly RAM expensive)
dplyr::full_join(df2, by = character()) %>%
# convert specific sample data columns to date
dplyr::mutate(across(contains("date"), as.Date)) %>%
# subset the data (you could use other functions here but this is exact enough as we are lookin on months)
dplyr::filter(date.y > date.x & date.y < (date.x + 365.25)) %>%
# rename and select columns
dplyr::select(date = 1, 2, 3, 5, 6) %>%
# bind the missing data back by selecting from within the pipe
dplyr::union(df1[!df1$date %in% unique(.$date),] %>%
# we have to simulate the missing columns and correct the date as dplyr::union() need the columns to match by name and type
dplyr::mutate(date = as.Date(date),
# you have to get the right NA_xxxx_ for you columns (critical!)
C = NA_character_,
D = NA_character_))
date A B C D
1 2020-01-01 5 4 <NA> <NA>
2 2018-01-01 <NA> 0 <NA> <NA>
Upvotes: 0