Reputation:
I am looking to compare two data elements or fields via Fuzzy Match Algorithm for Record Linkage in C#
, and I want to determine which algorithm would be best for each comparison.
The fields I am looking to compare are:
The Approximate String Matching Algorithms (ASMs) I am utilizing currently are:
Firstly, I am comparing two fields such as FirstName1
and FirstName2
and seeing if they are an exact match.
For example, FirstName1 = "Bob"
and FirstName2 = "Bob"
will be an exact match so it will not move on to fuzzy-matching.
On the other hand FirstName1 = "Jill"
and FirstName2 = "Bob"
will move on to a fuzzy-comparison on the two fields.
I want to know if anyone knows what fuzzy-match algorithm is better to use on certain field comparisons and not others, vice versa.
Upvotes: 2
Views: 1498
Reputation: 859
I just wrote some similar code for entity resolution. The key though is that not all fields are created equal. For example, you should not use ASMs on SSN
- even one number/character being different is a totally different SSN and person.
Instead of fuzzy matching address components, I would try to resolve the addresses first and then do an exact match. For example, a good address resolution service will treat:
Second Street NW
and NW 2nd St
as the same street even though they have very poor similarity by all those metrics.
Likewise, you can use Google's phone number parsing library (available for C#, Java, etc.) to format all phone numbers in a standard way and then do direct comparison.
I did use Jaro-Winkler to compare name components, but I did not research several of the metrics you have listed.
In short:
Canonicalize and compare
instead of fuzzy match.
Upvotes: 2