Reputation: 220
I have a large file of administrative data, about 1 million records. Individual people can be represented multiple times in this dataset. About half the records have an identifying code that maps records to individuals; for the half that don't, I need to fuzzy match names to flag records that potentially belong to the same person.
From looking at the records with the identifying code, I've created a list of differences that have occurred in the recording of names for the same individual:
Given the types of matches I'm after, is there a better approach than using agrep()/levenshtein's distance, that is easily implemented in R?
Edit: agrep() in R doesn't do a very good job with this problem - because of the large number of insertions and substitutions I need to allow to account for the ways names are recorded differently, a lot of false matches are thrown up.
Upvotes: 2
Views: 788
Reputation: 3285
The synthesisr package (https://cran.r-project.org/web/packages/synthesisr/index.html) might be helpful. It uses R code to mimic some of the fuzzy matching functionality in the fuzzywuzzy Python package and fuzzywuzzyR. There are different metrics similar taken from fuzzywuzzy; a lower score means a greater similarity. The methods are accessible into different ways as shown below.
Specifically, in this case, the "token" functions might be useful since strings are tokenized by whitespace then alphabetized to deal with situations like reversals.
library(synthesisr)
fuzz_m_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_m_ratio")
fuzz_partial_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_partial_ratio")
fuzz_token_sort_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_sort_ratio")
fuzz_token_set_ratio("this is a test", "this is a test!")
fuzzdist("this is a test", "this is a test!", method = "fuzz_token_set_ratio")
Upvotes: 0
Reputation: 965
I would make multiple passes.
"Jon .* Snow"
- Middle name
"Jon .*Snow"
- Second last name
Nicknames will require a dictionary of mappings from long form to short, there's no regular expression that'll handle his.
"Snow Jon"
- Reversal (duh)
agrep will handle minor misspellings.
You probably also want to tokenise your names into first-, middle- and last-.
Upvotes: 1