Display name
Display name

Reputation: 4481

Join data frames based fuzzy matching of strings

library(tidyverse)
library(fuzzyjoin)
df1 <- tibble(col1 = c("Apple Shipping", "Banana Shipping", "FedEX USA Ground",
                       "FedEx USA Commercial", "FedEx International"),
              col2 = 1:5)
#> # A tibble: 5 x 2
#>   col1                  col2
#>   <chr>                <int>
#> 1 Apple Shipping           1
#> 2 Banana Shipping          2
#> 3 FedEX USA Ground         3
#> 4 FedEx USA Commercial     4
#> 5 FedEx International      5

df2 <- tibble(col3 = c("Banana", "FedEX USA"), col4 = c(700, 900))
#> # A tibble: 2 x 2
#>   col3       col4
#>   <chr>     <dbl>
#> 1 Banana      700
#> 2 FedEX USA   900

The two data frames I'm working with are shown above. I'd like to fuzzy join them on col1 and col3 to come up with something similar to what's directly shown below. Basically the rule would be, "If all the text in col3 is in any of col1 count that as a match".

#> # A tibble: 3 x 4
#>   col1                  col2  col3      col4
#>   <chr>                <int>  <chr>    <int>
#> 1 Banana Shipping          2  Banana     700
#> 2 FedEX USA Ground         3  FedEx USA  900
#> 3 FedEx USA Commercial     4  FedEx USA  900

This older SO question appears to offer the solution, but it doesn't quite seem to work in this case and I get the errors shown below:

df1 %>% regex_inner_join(df2, by = c(string = "col3"))
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `col` is NULL
#> Run `rlang::last_error()` to see where the error occurred.

library(stringr)
df1 %>% fuzzy_inner_join(df2, by = c("string" = "col3"), match_fun = str_detect)
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `col` is NULL
#> Run `rlang::last_error()` to see where the error occurred.

How do I perform this fuzzy join with R?

Upvotes: 2

Views: 2080

Answers (1)

Ian Campbell
Ian Campbell

Reputation: 24770

Perhaps this is what you're looking for?

library(dplyr)
library(fuzzyjoin)
library(stringr)
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = str_detect)
## A tibble: 2 x 4
#  col1              col2 col3       col4
#  <chr>            <int> <chr>     <dbl>
#1 Banana Shipping      2 Banana      700
#2 FedEX USA Ground     3 FedEX USA   900

If you wanted to ignore case, you could define your own str_detect.

my_str_detect <- function(x,y){str_detect(x,regex(y, ignore_case = TRUE))}
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_str_detect)
## A tibble: 3 x 4
#  col1                  col2 col3       col4
#  <chr>                <int> <chr>     <dbl>
#1 Banana Shipping          2 Banana      700
#2 FedEX USA Ground         3 FedEX USA   900
#3 FedEx USA Commercial     4 FedEX USA   900

For bonus points you can use agrepl from this question.

You can modify the max.distance = argument and potentially add cost =. See help(agrepl) for more.

my_match_fun <- Vectorize(function(x,y) agrepl(x, y, ignore.case=TRUE, max.distance = 0.7, useBytes = TRUE))
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_match_fun)
## A tibble: 4 x 4
#  col1                  col2 col3       col4
#  <chr>                <int> <chr>     <dbl>
#1 Banana Shipping          2 Banana      700
#2 FedEX USA Ground         3 FedEX USA   900
#3 FedEx USA Commercial     4 FedEX USA   900
#4 FedEx International      5 FedEX USA   900

Upvotes: 5

Related Questions