mapping based on nearest date

Question

I'd like to take a list of emails that have an assocaited "owner", and dates, and match them against incoming orders, with dates and values. The matching key is email, but I would like it to match on nearest date, as there are some cases where an email address may have 2 different owners.

reproducible example:

require(dplyr)

e <- c("e3r@gmail.com", "e3r@gmail.com", "poi@aol.com", "tyu@gmail.com", "poi@aol.com", "tyu@gmail.com")
d <- c("2017-01-01", "2017-04-05", "2017-04-12", "2017-05-05", "2017-08-05", "2017-12-05")
p <- c("Jeff", "Sam", "Sam", "Jeff", "Matt", "Matt")
df <- data_frame(e, d, p) %>% mutate(d=as.Date(d))

o <- c("e3r@gmail.com", "poi@aol.com", "sdf@gmail.com", "tyu@gmail.com")
d2 <- c("2017-02-02", "2017-04-28", "2017-05-05", "2017-07-01")
a <- c(600.50, 3000, 4999.99, 2050.6)
df2 <- data_frame(o, d2, a) %>% mutate(d2 = as.Date(d2))

the first df is the map of email address is owned by p person. The df2 is the orders that came in with those email addresses, and I would like to add a column to df2 that pins the correct "owner" which is whoever has that email address assigned to them and is closest in date.

desired result:

> df2
# A tibble: 4 x 3
              o         d2       a       newowner
                         
1 e3r@gmail.com 2017-02-02  600.50       "Jeff"
2   poi@aol.com 2017-04-28 3000.00       "Sam"
3 sdf@gmail.com 2017-05-05 4999.99       NA
4 tyu@gmail.com 2017-07-01 2050.60       "Jeff"

Currently, it is not a massive issue - out of 70k records of "owned" email addresses, there are only ~86 cases of duplication. However, ownership is changing for the last few months of the year and thus could introduce a lot of duplication if ownership moves around. I'm currently taking the owner from the furthest date, and overwriting the other owners in those 86 examples.

Thanks for your help!

www · Accepted Answer

@Frank's answer is great and like he said, data.table has rolling join but dplyr does not. So I think data.table is a better choice here. However, if you want to stay in dplyr. Here is one idea.

library(dplyr)
df3 <- df %>%
  full_join(df2, by = c("e" = "o")) %>%
  mutate(Date_ABS_Diff = abs(d - d2)) %>%
  arrange(e, Date_ABS_Diff) %>%
  group_by(e) %>%
  slice(1) %>%
  select(o = e, d2, a, newowner = p)
df3
# A tibble: 4 x 4
# Groups:   o [4]
              o         d2       a newowner
                      
1 e3r@gmail.com 2017-02-02  600.50     Jeff
2   poi@aol.com 2017-04-28 3000.00      Sam
3 sdf@gmail.com 2017-05-05 4999.99     
4 tyu@gmail.com 2017-07-01 2050.60     Jeff

This approach uses full_join to combine all possible combination by email. After that, calculate the absolute difference between dates and use arrange and slice to select the record with minimum absolute dates.

mapping based on nearest date

Answers (2)

Related Questions