Ani
Ani

Reputation: 323

Fixing string variables with varying spellings, etc

I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:

first       last    address         phone    
Jimmy       Bamboo  P.O. Box 1190   xxx-xx-xx00
Jimmy W.    Bamboo  P.O. Box 1190   xxx-xx-xx22
James West  Bamboo  P.O. Box 219    xxx-66-xxxx

... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.

What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.

thanks!!

Ani

Upvotes: 0

Views: 840

Answers (2)

As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.

There are several tools you can use (we use them all).

1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.

2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.

3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.

Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.

Here are some other answers on similar topics I've made here on stackoverflow:

Processing of mongolian names

How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?

MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard

Upvotes: 3

Ari B. Friedman
Ari B. Friedman

Reputation: 72769

You can calculate the pairwise matrix of Levenshtein distances.

See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/

Upvotes: 0

Related Questions