Reputation: 2426
I want to compare names which are in different formats, eg: "George W. Bush", "George Bush", "George Walker Bush", "Bush, George Walker", "Bush, GW", "Bush, George" etc. There are few with dots (".") as well, but I omitted those from the list because I will normalize those anyways. In fact, the commas (",") will be stripped as well.
What is the best and optimized approach to determine if any 2 given names actually represent the same person? I have thought of using nameparser
and build a comparison algorithm, but please provide any other possible options. Any approach using standard modules of Python will be fine too.
Upvotes: 1
Views: 820
Reputation: 56
There's an open source library which can be useful, or at least can be used as base to build more functionalities.
Sample usage:
>>> from whoswho import who
>>> who.match('Bush, G.W.', 'George W. Bush')
Upvotes: 1
Reputation: 6663
The most accurate way of doing this is to use an NLP library, like spacy. It would allow you to compute the similarities between words.
If you want a simpler way of doing this, you may implement a simple algo, something like:
def norm(name):
return sorted(name.lower().replace('.', ''))
Then measure the difference between the two resulting strings...
But this obviously won't give an absolute result.
Upvotes: 1