Mike D
Mike D

Reputation: 207

Improved Levenshtein Algorithm

I recently implemented the levenshtein algorithm into our search engine database, but we have come across a problem.

According to the basic levenshtein

Levenshtein('123456','12x456') is the same value as Levenshtein('123456', '12345x')

Normally this is fine, but for my specific problem that is incorrect. When someone uses our website, this is incorrect. Manufacturers of electronic components often make similar products with only a difference in the very last letter. If the first letter is different, it's usually an entirely different category. Therefore I need an algorithm that considers matches near the beginning of the word more valuable than those in the back or in other words, mismatches that occur near the beginning should apply a larger penalty than those in the back.

If anyone has any idea please let me know.

Upvotes: 9

Views: 2823

Answers (3)

Pierre
Pierre

Reputation: 35306

See the Smith-Waterman algorithm widely used in bioinformatics. It can perform a local alignment of your query, but that will be slower that Levenshtein.

Upvotes: 2

Will Castillo
Will Castillo

Reputation: 21

Use the Jaro-Winkler Distance... It's exactly what you are asking for.

Upvotes: 2

Rob Di Marco
Rob Di Marco

Reputation: 44992

We had a similar issue and used the Brew edit distance method

We were in Perl so we used the Text::Brew library. My co-worker did a nice presentation on using a few different algorithms, including Brew.

Upvotes: 1

Related Questions