String similar to a set of strings

I need to compare a set of strings to another set of strings and find which strings are similar (fuzzy-string matching). For example:

{ "A.B. Mann Incorporated", "Mr. Enrique Bellini", "Park Management Systems" } 
and
{ "Park", "AB Mann Inc.", "E. Bellini" }

Assuming a zero-based index, the matches would be 0-1, 1-2, 2-0. Obviously, no algorithm can be perfect at this type of thing.

I have a working implementation of the Levenshtein-distance algorithm, but using it to find similar strings from each set necessitates looping through both sets of strings to do the comparison, resulting in an O(n^2) algorithm. This runs unacceptably slow even with modestly sized sets.

I've also tried a clustering algorithm that uses shingling and the Jaccard coefficient. Unfortunately, this too runs in O(n^2), which ends up being too slow, even with bit-level optimizations.

Does anyone know of a more efficient algorithm (faster than O(n^2)), or better yet, a library already written in C#, for accomplishing this?

Upvotes: 2

Answers (3)

Chen Li

Reputation: 86

This problem, called "string similarity join," has been studied a lot recently in the research community. We released a source code package in C++ called Flamingo that implements such an algorithm http://flamingo.ics.uci.edu/releases/4.1/src/partenum/. We also have a Hadoop-based implementation at http://asterix.ics.uci.edu/fuzzyjoin/ if your data set is too large for a single machine.

Upvotes: 0

paparazzo

Reputation: 45096

Not a direct answer to the O(N^2) but a comment on the N1 algorithm.

That is sample data but it is all clean. That is not data that I would use Levenstien on. Incriminate would have closer distance to Incorporated than Inc. E. would not match well to Enrique.

Levenshtein-distance is good at catching key entry errors.
It is also good for matching OCR.

If you have clean data I would go with stemming and other custom rules.
Porter stemmer is available for C# and if you have clean data
E.G.
remove . and other punctuation
remove stop words (the)
stem
parse each list once and assign an int value for each unique stem
do the match on int
still N^2 but now N1 is faster
you might add in a single cap the matches a word that start with cap gets a partial score
also need to account for number of words
two groups of 5 that match of 3 should score higher then two groups of 10 that match on 4

I would create Int hashsets for each phrase and then intersect and count.

Not sure you can get out of N^2.
But I am suggesting you look at N1.

Lucene is a library with phrase matching but it is not really set up for batches.
Create the index with the intent it is used many time so index search speed is optimized over index creation time.

Upvotes: 1

Olivier Jacot-Descombes

Reputation: 112259

In the given examples at least one word is always matching. A possible approach could use a multimap (a dictionary being able to store multiple entries per key) or a Dictionary<TKey,List<TVlaue>>. Each string from the first set would be splitted into single words. These words would be used as key in the multimap and the whole string would be stored as value.

Now you can split strings from the second set into single words and do an O(1) lookup for each word, i.e. an O(N) lookup for all the words. This yields a first raw result, where each match contains at least one matching word. Finally you would have to refine this raw result by applying other rules (like searching for initials or abbreviated words).

Upvotes: 0

String similar to a set of strings

Answers (3)

Related Questions