Reputation: 3722
I'd like to take a list of emails that have an assocaited "owner", and dates, and match them against incoming orders, with dates and values. The matching key is email, but I would like it to match on nearest date, as there are some cases where an email address may have 2 different owners.
reproducible example:
require(dplyr)
e <- c("[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]")
d <- c("2017-01-01", "2017-04-05", "2017-04-12", "2017-05-05", "2017-08-05", "2017-12-05")
p <- c("Jeff", "Sam", "Sam", "Jeff", "Matt", "Matt")
df <- data_frame(e, d, p) %>% mutate(d=as.Date(d))
o <- c("[email protected]", "[email protected]", "[email protected]", "[email protected]")
d2 <- c("2017-02-02", "2017-04-28", "2017-05-05", "2017-07-01")
a <- c(600.50, 3000, 4999.99, 2050.6)
df2 <- data_frame(o, d2, a) %>% mutate(d2 = as.Date(d2))
the first df is the map of email address is owned by p
person. The df2
is the orders that came in with those email addresses, and I would like to add a column to df2
that pins the correct "owner" which is whoever has that email address assigned to them and is closest in date.
desired result:
> df2
# A tibble: 4 x 3
o d2 a newowner
<chr> <date> <dbl> <char>
1 [email protected] 2017-02-02 600.50 "Jeff"
2 [email protected] 2017-04-28 3000.00 "Sam"
3 [email protected] 2017-05-05 4999.99 NA
4 [email protected] 2017-07-01 2050.60 "Jeff"
Currently, it is not a massive issue - out of 70k records of "owned" email addresses, there are only ~86 cases of duplication. However, ownership is changing for the last few months of the year and thus could introduce a lot of duplication if ownership moves around. I'm currently taking the owner from the furthest date, and overwriting the other owners in those 86 examples.
Thanks for your help!
Upvotes: 3
Views: 2027
Reputation: 66819
This is a standard rolling update join in data.table:
library(data.table)
DT = data.table(df)
DT2 = data.table(df2)
DT2[DT, on=.(o = e, d2 = d), roll=-Inf, v := i.p ]
o d2 a v
1: [email protected] 2017-02-02 600.50 Jeff
2: [email protected] 2017-04-28 3000.00 Sam
3: [email protected] 2017-05-05 4999.99 NA
4: [email protected] 2017-07-01 2050.60 Jeff
If, per the title, you want the nearest match, use roll = "nearest"
instead of roll = -Inf
.
Dplyr doesn't have rolling joins and will never have update joins, so there's no analogue there.
Upvotes: 7
Reputation: 39154
@Frank's answer is great and like he said, data.table
has rolling join but dplyr
does not. So I think data.table
is a better choice here. However, if you want to stay in dplyr
. Here is one idea.
library(dplyr)
df3 <- df %>%
full_join(df2, by = c("e" = "o")) %>%
mutate(Date_ABS_Diff = abs(d - d2)) %>%
arrange(e, Date_ABS_Diff) %>%
group_by(e) %>%
slice(1) %>%
select(o = e, d2, a, newowner = p)
df3
# A tibble: 4 x 4
# Groups: o [4]
o d2 a newowner
<chr> <date> <dbl> <chr>
1 [email protected] 2017-02-02 600.50 Jeff
2 [email protected] 2017-04-28 3000.00 Sam
3 [email protected] 2017-05-05 4999.99 <NA>
4 [email protected] 2017-07-01 2050.60 Jeff
This approach uses full_join
to combine all possible combination by email. After that, calculate the absolute difference between dates and use arrange
and slice
to select the record with minimum absolute dates.
Upvotes: 3