Reputation: 1089
I have two large dataframes, dfA
and dfB
, for which I have generated simple examples here
dfA = data.frame(id=c("Apple", "Banana", "Carrot", "Dates", "Egg"),
Answer_Date=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2016-12-07", "2017-12-07" )),
x1 = c(1, 2, 3, 4, 5),
x2 = c(10, 20, 30, 40, 50))
Browse[2]> dfA
id Answer_Date x1 x2
1 Apple 2013-12-07 1 10
2 Banana 2014-12-07 2 20
3 Carrot 2015-12-07 3 30
4 Dates 2016-12-07 4 40
5 Egg 2017-12-07 5 50
dfB = data.frame(id=c("Apple", "Apple", "Banana", "Banana", "Banana"),
Answer_Date=as.Date(c("2013-12-05", "2014-12-07", "2015-12-10", "2018-11-07", "2019-11-07" )),
x3 = c(5, 4, 3, 2, 1),
x4 = c(50, 40, 30, 20, 10))
Browse[2]> dfB
id Answer_Date x3 x4
1 Apple 2013-12-05 5 50
2 Apple 2014-12-07 4 40
3 Banana 2014-12-10 3 30
4 Banana 2018-11-07 2 20
5 Banana 2019-11-07 1 10
I'd like to merge them by the closest date so that I get the items that exist in both dfA and dfB matched exactly by id and as closely as possible by Answer_Date (i.e minimum absolute value of date difference between the two dates). In this case I'd like to get
dfC
id Answer_Date.x Answer_Date.y x1 x2 x3 x4
1 Apple 2013-12-07 2013-12-05 1 10 5 50
2 Banana 2014-12-07 2014-12-10 2 20 3 30
Unfortunately struggling with merge() and trying out various solutions that I have found on StackOverflow hasn't solved my problem and has only got me confused. Would someone kindly point me to the right solution, ideally with a simple explanation as to why it works?
Sincerely and with many thanks in advance
Thomas Philips
Upvotes: 4
Views: 1136
Reputation: 328
Left join dfB
to dfA
, take the difference between dates per row and choose the smallest diff per id.
left_join(dfA, dfB, by = "id") %>%
mutate(date_diff = abs(Answer_Date.x - Answer_Date.y)) %>%
group_by(id) %>%
filter(date_diff == min(date_diff)) %>%
select(id, Answer_Date.x, Answer_Date.y, starts_with("x"), date_diff)
Then output is:
# A tibble: 2 x 8
# Groups: id [2]
id Answer_Date.x Answer_Date.y x1 x2 x3 x4 date_diff
<fct> <date> <date> <dbl> <dbl> <dbl> <dbl> <drtn>
1 Apple 2013-12-07 2013-12-05 1 10 5 50 2 days
2 Banana 2014-12-07 2014-12-10 2 20 3 30 3 days
By the way, in your sample code the third Answer_Date
in the definition of dfB
should be "2014-12-10"
instead of "2015-12-10"
.
Upvotes: 2