Reputation: 967
I'm trying to join two tables with some codes in a way where in one column, the keys might be a subset of the original key.
Event
id date ProductId quantity
a xyz 1234567 30
a abc 5826811 20
b def 3619100 10
b ghi 9268420 50
ProductDimension
code name type
234-567 p1 c1
826-81 p2 c2
61-9100 p3 c3
Result should be:
eventAU
id date ProductId quantity name type
a xyz 1234567 30 p1 c1
a abc 5826811 20 p2 c2
b def 3619100 10 p3 c3
Taking a cue from this question, I'm trying to do a fuzzy join using the following:
ProductDimension$regex <- gsub("-", "", ProductDimension$code)
eventTbl <- tbl_df(Events)
prodcutTbl <- tbl_df(ProductDimension)
eventsAU <- regex_left_join(eventTbl , prodcutTbl , by = c(ProductId = "regex"))
but I'm getting the following exception :
Error: All columns in a tibble must be 1d or 2d objects: * Column `col` is NULL
Upvotes: 1
Views: 213
Reputation: 40171
One dplyr
and fuzzyjoin
option could be:
stringdist_inner_join(df1,
df2 %>%
mutate(code = sub("-", "", code)),
method = "lv",
by = c("ProductId" = "code"))
id date ProductId quantity code name type
<chr> <chr> <int> <int> <chr> <chr> <chr>
1 a xyz 1234567 30 234567 p1 c1
2 a abc 5826811 20 82681 p2 c2
3 b def 3619100 10 619100 p3 c3
Or if you specify the maximum distance, you can skip the sub()
part and use the use of dplyr
:
stringdist_inner_join(df1,
df2,
method = "lv",
max_dist = 3,
by = c("ProductId" = "code"))
Upvotes: 1