Radoslaw
Radoslaw

Reputation: 243

Smart string comparison

I am looking for a library/class that allows smart compare of two strings. At best it would give as a result percent of how two strings are alike. I am comparing company names, addresses that are recordered in different repositories, thus having many misspellings or inconsistencies in names.

Sample strings to compare:

 "Good Company Ltd." vs. "GoodCompany" 
 "Baker Street 2" vs. "Baker Str. 2" 

If I get a result in percentage of alikeness, than this can be an input for smart merge of such data.

Do you know any good libraries that would allow such smart string compare?

Upvotes: 15

Views: 3850

Answers (3)

Artemix
Artemix

Reputation: 2171

You might want to look for Levenshtein Distance implementation. It shows how many characters insertions/deletions and substitutions are required to make two strings equal.

Here is a post about libraries in C# that implement Levenshtein Distance and other text-comparison algorithms: .NET library for text algorithms?.

However I think you'll have to use some combination of methods, because using Levenshtein will tell you that 'Good Company Ltd.' is more similar to 'Bad Company Ltd.' than to 'GoodCompany'.

Maybe you'll have to do some preprocessing by expanding 'str.' to 'street' and removing 'Ltd.' as 'meaningless' word in terms of string comparison.

UPDATE 1

Francesco De Lisi suggests to use Phonetic algoritms. Looks like they are better suited for comparing misspelled names. Still you'll need to split addresses to phonetic / non-phonetic parts (like building numbers) and compare them separately.

UPDATE 2

As for addresses comparison, this post suggests to use Google Maps API for this purpose and another post discusses address parsing. I guess that Google can produce reliable results because they have a database of street addresses where they can find the most correct street name spelling. Without list of correct street/company names you may encounter some strange name that is incorrect, however many different correct names would be similar to it.

Upvotes: 14

Francesco De Lisi
Francesco De Lisi

Reputation: 1523

Levenshtein is not appropriate in this case. "Good Company Ltd" and "GoodCompany" if trimmed have a distance = 3 while "Good Company Ltd" and "Food Company Ltd" have a distance of 1, but totally a different meaning. I suggest Metaphone or Double Metaphone algorithm.

Using online metaphone comparer the results are:

Good Company Ltd = KTKMPNLTT
GoodCompany = KTKMPN
Food Company Ltd = FTKMPNLTT
GoodCompanyLLC = KTKMPNLK

In this way you know that GoodCompany, Good Company Ltd and GoodCompanyLLC are similar, while Food Company is misspelled or totally not related (KTKMPN is contained both in KTKMPNLTT and KTKMPNLK but not in FTKMPNLTT).

Look here for other algorithms comparisons.

Upvotes: 16

Anton Gogolev
Anton Gogolev

Reputation: 115769

What you're looking for is a Levenshtein distance (Wikipedia):

...the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other

Upvotes: 8

Related Questions