Matt W.
Matt W.

Reputation: 3722

mapping based on nearest date

I'd like to take a list of emails that have an assocaited "owner", and dates, and match them against incoming orders, with dates and values. The matching key is email, but I would like it to match on nearest date, as there are some cases where an email address may have 2 different owners.

reproducible example:

require(dplyr)

e <- c("[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]")
d <- c("2017-01-01", "2017-04-05", "2017-04-12", "2017-05-05", "2017-08-05", "2017-12-05")
p <- c("Jeff", "Sam", "Sam", "Jeff", "Matt", "Matt")
df <- data_frame(e, d, p) %>% mutate(d=as.Date(d))

o <- c("[email protected]", "[email protected]", "[email protected]", "[email protected]")
d2 <- c("2017-02-02", "2017-04-28", "2017-05-05", "2017-07-01")
a <- c(600.50, 3000, 4999.99, 2050.6)
df2 <- data_frame(o, d2, a) %>% mutate(d2 = as.Date(d2))

the first df is the map of email address is owned by p person. The df2 is the orders that came in with those email addresses, and I would like to add a column to df2 that pins the correct "owner" which is whoever has that email address assigned to them and is closest in date.

desired result:

> df2
# A tibble: 4 x 3
              o         d2       a       newowner
          <chr>     <date>   <dbl>       <char>
1 [email protected] 2017-02-02  600.50       "Jeff"
2   [email protected] 2017-04-28 3000.00       "Sam"
3 [email protected] 2017-05-05 4999.99       NA
4 [email protected] 2017-07-01 2050.60       "Jeff"

Currently, it is not a massive issue - out of 70k records of "owned" email addresses, there are only ~86 cases of duplication. However, ownership is changing for the last few months of the year and thus could introduce a lot of duplication if ownership moves around. I'm currently taking the owner from the furthest date, and overwriting the other owners in those 86 examples.

Thanks for your help!

Upvotes: 3

Views: 2027

Answers (2)

Frank
Frank

Reputation: 66819

This is a standard rolling update join in data.table:

library(data.table)
DT = data.table(df)
DT2 = data.table(df2)
DT2[DT, on=.(o = e, d2 = d), roll=-Inf, v := i.p ]

               o         d2       a    v
1: [email protected] 2017-02-02  600.50 Jeff
2:   [email protected] 2017-04-28 3000.00  Sam
3: [email protected] 2017-05-05 4999.99   NA
4: [email protected] 2017-07-01 2050.60 Jeff

If, per the title, you want the nearest match, use roll = "nearest" instead of roll = -Inf.

Dplyr doesn't have rolling joins and will never have update joins, so there's no analogue there.

Upvotes: 7

www
www

Reputation: 39154

@Frank's answer is great and like he said, data.table has rolling join but dplyr does not. So I think data.table is a better choice here. However, if you want to stay in dplyr. Here is one idea.

library(dplyr)
df3 <- df %>%
  full_join(df2, by = c("e" = "o")) %>%
  mutate(Date_ABS_Diff = abs(d - d2)) %>%
  arrange(e, Date_ABS_Diff) %>%
  group_by(e) %>%
  slice(1) %>%
  select(o = e, d2, a, newowner = p)
df3
# A tibble: 4 x 4
# Groups:   o [4]
              o         d2       a newowner
          <chr>     <date>   <dbl>    <chr>
1 [email protected] 2017-02-02  600.50     Jeff
2   [email protected] 2017-04-28 3000.00      Sam
3 [email protected] 2017-05-05 4999.99     <NA>
4 [email protected] 2017-07-01 2050.60     Jeff

This approach uses full_join to combine all possible combination by email. After that, calculate the absolute difference between dates and use arrange and slice to select the record with minimum absolute dates.

Upvotes: 3

Related Questions