Reputation: 4481
library(tidyverse)
library(fuzzyjoin)
df1 <- tibble(col1 = c("Apple Shipping", "Banana Shipping", "FedEX USA Ground",
"FedEx USA Commercial", "FedEx International"),
col2 = 1:5)
#> # A tibble: 5 x 2
#> col1 col2
#> <chr> <int>
#> 1 Apple Shipping 1
#> 2 Banana Shipping 2
#> 3 FedEX USA Ground 3
#> 4 FedEx USA Commercial 4
#> 5 FedEx International 5
df2 <- tibble(col3 = c("Banana", "FedEX USA"), col4 = c(700, 900))
#> # A tibble: 2 x 2
#> col3 col4
#> <chr> <dbl>
#> 1 Banana 700
#> 2 FedEX USA 900
The two data frames I'm working with are shown above. I'd like to fuzzy join them on col1
and col3
to come up with something similar to what's directly shown below. Basically the rule would be, "If all the text in col3
is in any of col1
count that as a match".
#> # A tibble: 3 x 4
#> col1 col2 col3 col4
#> <chr> <int> <chr> <int>
#> 1 Banana Shipping 2 Banana 700
#> 2 FedEX USA Ground 3 FedEx USA 900
#> 3 FedEx USA Commercial 4 FedEx USA 900
This older SO question appears to offer the solution, but it doesn't quite seem to work in this case and I get the errors shown below:
df1 %>% regex_inner_join(df2, by = c(string = "col3"))
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `col` is NULL
#> Run `rlang::last_error()` to see where the error occurred.
library(stringr)
df1 %>% fuzzy_inner_join(df2, by = c("string" = "col3"), match_fun = str_detect)
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `col` is NULL
#> Run `rlang::last_error()` to see where the error occurred.
How do I perform this fuzzy join with R?
Upvotes: 2
Views: 2080
Reputation: 24770
Perhaps this is what you're looking for?
library(dplyr)
library(fuzzyjoin)
library(stringr)
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = str_detect)
## A tibble: 2 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
If you wanted to ignore case, you could define your own str_detect
.
my_str_detect <- function(x,y){str_detect(x,regex(y, ignore_case = TRUE))}
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_str_detect)
## A tibble: 3 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
#3 FedEx USA Commercial 4 FedEX USA 900
For bonus points you can use agrepl
from this question.
You can modify the max.distance =
argument and potentially add cost =
. See help(agrepl)
for more.
my_match_fun <- Vectorize(function(x,y) agrepl(x, y, ignore.case=TRUE, max.distance = 0.7, useBytes = TRUE))
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_match_fun)
## A tibble: 4 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
#3 FedEx USA Commercial 4 FedEX USA 900
#4 FedEx International 5 FedEX USA 900
Upvotes: 5