Reputation: 3765
I'm using fuzzyjoin
to cross politicians and their respective regions:
library(dplyr)
library(fuzzyjoin)
x <- tibble(name = c("Fulvio Rossi Ciocca", "Rigoberto Del Carmen Rojas Sarapura", "Lorena Vergara Bravo", "Lily Perez San Martin"),
activity = c("surgeon", "business", "public administration", "publicist"))
y <- tibble(name = c("Rossi Ciocca Fulvio", "Perez San Martin Lily"), region = c(1,5))
z <- x %>%
stringdist_inner_join(y, max_dist = 10)
On my example "Fulvio Rossi Ciocca" and "Rossi Ciocca Fulvio" are the same person. In fact, all the data in my datasets contains the same people but with variations like "Lennon John" instead of "John Lennon".
I did look fuzzyjoin
documentation but I don't find a way to write a working version of this pseudo-code:
x %>%
fuzzy_join(y, mode = "left", match_fun = "A ~ permutations(A)")
Upvotes: 2
Views: 504
Reputation: 5704
You can construct a unique "normalized" version of each name by sorting its parts alphabetically.
Then two names can be considered identical when they share the same normalized form.
Hence a possible solution is:
normalize <- function(v) lapply(strsplit(v, " "), sort)
mf <- function(a, b) mapply(identical, normalize(a), normalize(b))
fuzzy_left_join(x, y, by = "name", match_fun = mf)
# # A tibble: 4 x 4
# name.x activity name.y region
# <chr> <chr> <chr> <dbl>
# 1 Fulvio Rossi Ciocca surgeon Rossi Ciocca Fulvio 1
# 2 Rigoberto Del Carmen Rojas Sarapura business <NA> NA
# 3 Lorena Vergara Bravo public administration <NA> NA
# 4 Lily Perez San Martin publicist Perez San Martin Lily 5
Upvotes: 4