pachadotdev
pachadotdev

Reputation: 3765

fuzzy join with permutations in strings

I'm using fuzzyjoin to cross politicians and their respective regions:

library(dplyr)
library(fuzzyjoin)

x <- tibble(name = c("Fulvio Rossi Ciocca", "Rigoberto Del Carmen Rojas Sarapura", "Lorena Vergara Bravo", "Lily Perez San Martin"),
            activity = c("surgeon", "business", "public administration", "publicist"))

y <- tibble(name = c("Rossi Ciocca Fulvio", "Perez San Martin Lily"), region = c(1,5))

z <- x %>%
  stringdist_inner_join(y, max_dist = 10)

On my example "Fulvio Rossi Ciocca" and "Rossi Ciocca Fulvio" are the same person. In fact, all the data in my datasets contains the same people but with variations like "Lennon John" instead of "John Lennon".

I did look fuzzyjoin documentation but I don't find a way to write a working version of this pseudo-code:

x %>%
  fuzzy_join(y, mode = "left", match_fun = "A ~ permutations(A)")

Upvotes: 2

Views: 504

Answers (1)

Scarabee
Scarabee

Reputation: 5704

You can construct a unique "normalized" version of each name by sorting its parts alphabetically.

Then two names can be considered identical when they share the same normalized form.

Hence a possible solution is:

normalize <- function(v) lapply(strsplit(v, " "), sort)

mf <- function(a, b) mapply(identical, normalize(a), normalize(b))

fuzzy_left_join(x, y, by = "name", match_fun = mf)
# # A tibble: 4 x 4
#                                name.x              activity                name.y region
#                                 <chr>                 <chr>                 <chr>  <dbl>
# 1                 Fulvio Rossi Ciocca               surgeon   Rossi Ciocca Fulvio      1
# 2 Rigoberto Del Carmen Rojas Sarapura              business                  <NA>     NA
# 3                Lorena Vergara Bravo public administration                  <NA>     NA
# 4               Lily Perez San Martin             publicist Perez San Martin Lily      5

Upvotes: 4

Related Questions