London guy
London guy

Reputation: 28012

Any disambiguation tools/APIs for matching names?

Given two names that have variations in the way they are represented, is there any API/tool/algorithm that can give a score of how similar/different the names are?

Tim O' Reilly is one input and T Reilly is another input. The score returned between these two should be lesser than that got between Tim O' Reilly and Tim Reilly.

I am looking for such score calculation mechanisms. Few challenges that the algorithm should be capable of handling are:
1) The first names and last names could be swapped when a name is given as input
2) There might be initials in place of names
3) One of the names may not have the last name while the other may have both first name and last name.

... and so on which are common errors in name representations.

Upvotes: 1

Views: 185

Answers (2)

zdepablo
zdepablo

Reputation: 452

Two libraries including a handful of distance scores for name similarity are:

No single method covers the cases that you mention but for 1) and 3) Feature and Set similarity measures (jaccard, tfidf for instance) work- For 2) besides soundex (as mentioned by @houman001) you may consider levensthein or jaro. Experiment with some examples of your use case and combine.

Upvotes: 1

n0rmzzz
n0rmzzz

Reputation: 3848

For the "API/tool/algorithm that can give a score of how similar/different the names are" part, I can give you a hint:

There are a few heuristic libraries that search engines use, but there is also this coding called soundex that computes a number out of a word. Words with the same soundex code are those that are slightly different. There are some Java implementations around as well.

On the points you mentioned later about names, look for contact management libraries/utilities and do some coding as these requirements are pretty specific.

Upvotes: 0

Related Questions