Berry Tsakala
Berry Tsakala

Reputation: 16620

Human name comparison: ways to approach this task

I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).

Here's what i've learned so far:

I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.

For example, all the names below can refer to the same person:

I'm trying to:

  1. build (or copy) an algorithm which grades the relationship 2 input names
  2. find an indexing method (for names in my database, for hash tables, etc.)

note: My task isn't about finding names in text, but to compare 2 names. e.g.

name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%

Upvotes: 3

Views: 2698

Answers (5)

Steve Mc
Steve Mc

Reputation: 3441

We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.

One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.

Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.

Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.

Upvotes: 1

beak42
beak42

Reputation: 736

I had real problems with the Tanimoto using utf-8.

What works for languages that use diacritical signs is difflib.SequenceMatcher()

Upvotes: 0

Jacob
Jacob

Reputation: 78860

Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.

Upvotes: 0

Nick Dandoulakis
Nick Dandoulakis

Reputation: 43130

I used Tanimoto Coefficient for a quick (but not super) solution, in Python:

"""
Formula:
  Na = number of set A elements
  Nb = number of set B elements
  Nc = number of common items

  T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
    c = [v for v in a if v in b]
    return float(len(c)) / (len(a)+len(b)-len(c))

def name_compare(name1, name2):
    return tanimoto(name1, name2)


>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>> 

Edit: A link to a good and useful book.

Upvotes: 7

Jacob
Jacob

Reputation: 78860

Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.

Upvotes: 1

Related Questions