hbabbar
hbabbar

Reputation: 967

Join on substring of a key in R

I'm trying to join two tables with some codes in a way where in one column, the keys might be a subset of the original key.

Event
id  date  ProductId  quantity
a   xyz   1234567    30
a   abc   5826811    20
b   def   3619100    10
b   ghi   9268420    50

ProductDimension
code     name  type
234-567  p1    c1
826-81   p2    c2
61-9100  p3    c3  


Result should be:
eventAU
id date ProductId quantity name  type
a   xyz   1234567    30    p1     c1
a   abc   5826811    20    p2     c2
b   def   3619100    10    p3     c3

Taking a cue from this question, I'm trying to do a fuzzy join using the following:

ProductDimension$regex <- gsub("-", "", ProductDimension$code)

eventTbl <- tbl_df(Events)
prodcutTbl <- tbl_df(ProductDimension)

eventsAU <- regex_left_join(eventTbl , prodcutTbl , by = c(ProductId = "regex"))

but I'm getting the following exception :

Error: All columns in a tibble must be 1d or 2d objects: * Column `col` is NULL

Upvotes: 1

Views: 213

Answers (1)

tmfmnk
tmfmnk

Reputation: 40171

One dplyr and fuzzyjoin option could be:

stringdist_inner_join(df1, 
                      df2 %>%
                       mutate(code = sub("-", "", code)),
                      method = "lv",
                      by = c("ProductId" = "code"))

  id    date  ProductId quantity code   name  type 
  <chr> <chr>     <int>    <int> <chr>  <chr> <chr>
1 a     xyz     1234567       30 234567 p1    c1   
2 a     abc     5826811       20 82681  p2    c2   
3 b     def     3619100       10 619100 p3    c3   

Or if you specify the maximum distance, you can skip the sub() part and use the use of dplyr:

stringdist_inner_join(df1, 
                      df2,
                      method = "lv",
                      max_dist = 3,
                      by = c("ProductId" = "code"))

Upvotes: 1

Related Questions