Reputation: 539
I'm attempting to do a dplyr left join on two dataframes based on greatly similar language (that's not exact).
DF1:
title | records
Bob's show, part 1 | 42
Time for dinner | 77
Horsecrap | 121
DF2:
showname | counts
Bob's show part 1 | 772
Dinner time | 89
No way Jose | 123
I execute this to get the string distances as a vector using the stringdist package/library:
titlematch <- amatch(df1$title,df2$showname)
The vector looks like... well, an integer vector:
titlematch
1
2
NA
Normally if I had exact matches I'd do:
blended <- left_join(df1, df2, by = c("title" = "showname"))
How do I do the left join using the vector as the record selector so that the end outcome would be:
title | records | showname | counts
Bob's show, part 1 | 42 | Bob's show part 1 | 772
Time for dinner | 77 | Dinner time | 89
With the third non-match excluded because there's not a probable match in the vector (NA).
Upvotes: 2
Views: 720
Reputation: 539
camille suggested in a comment:
Have you looked at
fuzzyjoin
?
I had never heard of fuzzyjoin
before, but I tried it and loved it. stringdist_left_join
was exactly what I needed.
Upvotes: 1
Reputation: 5405
Here's a shot,
library(stringdist)
library(tidyverse)
df1 %>%
as_tibble() %>%
mutate(temp = amatch(title, df2$showname, maxDist = 10)) %>%
bind_cols(df2[.$temp, ]) %>%
select(-temp)
# A tibble: 3 x 4
title records showname counts
<chr> <int> <chr> <int>
1 Bob's show, part 1 42 Bob's show part 1 772
2 Time for dinner 77 Dinner time 89
3 Horsecrap 121 Dinner time 89
I'm unable to reproduce your numeric matching vector, amatch(df1$title, df2$showname)
gives me [1] NA NA NA
as it looks like the default is 0.1, so I set maxDist
to 10.
At the end, you could always add %>% filter(is.na(showname))
to remove any row which had no match.
df1 <- structure(list(title = c("Bob's show, part 1", "Time for dinner",
"Horsecrap"), records = c(42L, 77L, 121L)), .Names = c("title",
"records"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(showname = c("Bob's show part 1", "Dinner time",
"No way Jose"), counts = c(772L, 89L, 123L)), .Names = c("showname",
"counts"), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 2