Matching string using levenshtein distance and euristics

Question

I have string patterns ('rules'), in 'categories'. e.g.:

Category1

lorem ipsum dolor sit amet
consectetur adipiscing elit
fusce sit amet ante nisi
lorem ut sem interdum molestie
suspendisse non lorem ut sem interdum molestie

Category2

vivamus porta non metus egestas finibus
nam convallis augue nec laoreet pretium
turpis velit cursus enim ac suscipit risus turpis in metus

Now, I want to be able to 'categorize' a string based on those rules. Let's say we want to find out which category the string fusce laoreet amet ante nisi belongs to. My current implementation will use levenshtein distance implementation and find out that the string mostly 'looks like' fusce sit amet ante nisi and hence, the category is Category1.

Let's say we want to categorize vivamus vel lorem imperdiet sit. Because I put threshold 1/5th of the string length (i.e. the string must be at least 80% similar to its match) on the levenshtein distance algo, the string will remain 'uncategorized'.

In such case I would continue with the following algorithm ...

From each category, I will extract the 'common words' - i.e. words which repeat between the rules within the category. In way, those are the dominating words in the category. So, we'll have:

Category1

lorem: 3
sit: 2
amet: 2
sem: 2
interdum: 2
molestie: 2

Category2

metus: 2
turpis: 2

Now I will split the vivamus vel lorem imperdiet sit string word by word and I will give each category a value, depending on how many of the string words are present in the category's 'dominating words'. i.e.:

Category1 will have value of 3 (lorem) + 2 (sit), and Category2 will have a value of 0 (no matches between the split words of the string I am categorizing and the dominating words in the category). The highest-value category 'wins'.

In short, my algorithm is:

Use levenshtein distance with a threshold of allowing 1/5th of the string to change, to find the closest matching rule.
If it fails, split the string we are categorizing into words and with each word, check how 'dominating' that word is in each category, creating a value for the category. The highest value category is our best guess.

Is there a better way to do this? Do you see a problem with this algorithm? Any suggestions?

Matching string using levenshtein distance and euristics

Answers (1)

Related Questions