Reputation: 34703
I've got a database with property owners; I would like to count the number of properties owned by each person, but am running into standard mismatch problems:
REDEVELOPMENT AUTHORITY vs. REDEVELOPMENT AUTHORITY O vs. PHILADELPHIA REDEVELOPMEN vs. PHILA. REDEVELOPMENT AUTH
COMMONWEALTH OF PENNA vs. COMMONWEALTH OF PENNSYLVA vs. COMMONWEALTH OF PA
TRS UNIV OF PENN vs. TRUSTEES OF THE UNIVERSIT
From what I've seen, this is a pretty common problem, but my problem differs from those with solutions I've seen for two reasons:
1) I've got a large number of strings (~570,000), so computing the 570000 x 570000 matrix of edit distances (or other pairwise match metrics) seems like a daunting use of resources
2) I'm not focused on one-off comparisons--e.g., as is most common for what I've seen from big data fuzzy matching questions, matching user input to a database on file. I have one fixed data set that I want to condense once and for all.
Are there any well-established routines for such an exercise? I'm most familiar with Python and R, so an approach in either of those would be ideal, but since I only need to do this once, I'm open to branching out to other, less familiar languages (perhaps something in SQL?) for this particular task.
Upvotes: 0
Views: 1077
Reputation: 453
You can also use agrep() in R for fuzzy name matching, by giving a percentage of allowed mismatches. If you pass it a fixed dataset, then you can grep for matches out of your database.
Upvotes: 1
Reputation: 5169
That is exactly what I am facing at my new job daily (but lines counts are few million). My approach is to:
1) find a set of unique strings by using p = unique(a)
2) remove punctuation, split strings in p
by whitespaces, make a table of words' frequencies, create a set of rules and use gsub
to "recover" abbreviations, mistyped words, etc. E.g. in your case "AUTH" should be recovered back to "AUTHORITY", "UNIV" -> "UNIVERSITY" (or vice versa)
3) recover typos if I spot them by eye
4) advanced: reorder words in strings (to often an improper English) to see if the two or more strings are identical albeit word order (e.g. "10pack 10oz" and "10oz 10pack").
Upvotes: 1