Reputation: 477
I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:
CONAIR
AIRCON
The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.
I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.
Is there a better algorithm I can use? Or what do you guys recommend me?
EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).
Upvotes: 29
Views: 13183
Reputation: 113
try using other similarity measures like sorenson,jaccard and jaro_winkler
personally i am huge fan of jaro winkler since it served my purpose many times.
from Levenshtein import jaro_winkler
In [2]: jaro_winkler("conair","aircon")
Out[2]: 0.8333333333333334
Upvotes: 2
Reputation: 311
Take a look to Needleman-Wunsch, or Smith-Waterman algorithms. They are used to handle string matching via adapted edit-distance for DNA sequences, where any kind of insertions, reversals, transposons may happen, to any length, at any place. Saying this, I need to add that for a sufficiently long string there is no optimal solution. And don't forget that the edit cost depend on the use-context of the algorithm (a semantics issue), while any algorithm is always a syntactical machine.
Upvotes: 1
Reputation: 62106
I think this can be easily solved by employing the Longest Common Substring/Subsequence algorithm on one of the strings (e.g. "conair") and the other string appended to itself once (e.g. "aircon" -> "airconaircon").
Sample code in C:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
// Returns the length of the longest common substring (LCS)
// between two given strings.
//
// This recursive implementation can be replaced by a
// more performant dynamic programming implementation.
size_t llcs(const char* s1, const char* s2)
{
size_t len[3];
if (*s1 == '\0' || *s2 == '\0') return 0;
len[0] = (*s1 == *s2) + llcs(s1 + 1, s2 + 1);
len[1] = llcs(s1 + 1, s2);
len[2] = llcs(s1, s2 + 1);
if (len[0] < len[1]) len[0] = len[1];
if (len[0] < len[2]) len[0] = len[2];
return len[0];
}
// Returns similarity of two given strings in the range
// from 0.0 to 1.0 (1.0 for equal strings).
double similarity(const char* s1, const char* s2)
{
size_t s1len = strlen(s1);
size_t s2len = strlen(s2);
double sim;
if (s1len == 0 && s2len == 0)
{
// Two empty strings are equal
sim = 1;
}
else
{
size_t len;
// Append s1 to itself in s1s1 (e.g. "aircon" -> "airconaircon")
char* s1s1 = malloc(s1len * 2 + 1);
strcpy(s1s1, s1);
strcpy(s1s1 + s1len, s1);
// Find the length of the LCS between s1s1 and s2
// (e.g. between "airconaircon" and "conair")
len = llcs(s1s1, s2);
// We need it not longer than s1 (e.g. "aircon")
// since we're actually comparing s1 and s2
if (len > s1len) len = s1len;
len *= 2;
// Prevent 100% similarity between a string and its
// cyclically shifted version (e.g. "aircon" and "conair")
if (len == s1len + s2len && strcmp(s1, s2) != 0) len--;
// Get the final measure of the similarity
sim = (double)len / (s1len + s2len);
free(s1s1);
}
return sim;
}
int main(int argc, char** argv)
{
if (argc == 3)
printf("Similarity of \"%s\" and \"%s\" is %.2f%%\n",
argv[1], argv[2], 100 * similarity(argv[1], argv[2]));
else
printf("Usage:\n %s string1 string2\n",
argv[0]);
return 0;
}
Sample output:
Similarity of "123" and "123" is 100.00%
Similarity of "123" and "1234" is 85.71%
Similarity of "0123" and "123" is 85.71%
Similarity of "a" and "aa" is 66.67%
Similarity of "aa" and "a" is 66.67%
Similarity of "aaaaaaa" and "aaaaaa" is 92.31%
Similarity of "aaaaaa" and "aaaaaaa" is 92.31%
Similarity of "aircon" and "conair" is 91.67%
Similarity of "spit" and "pits" is 87.50%
Similarity of "pits" and "spit" is 87.50%
Similarity of "spits" and "pits" is 88.89%
Similarity of "pits" and "spits" is 88.89%
Upvotes: 7
Reputation: 1973
Sorting the words and finding Levenshtein would give a 100% match for your example but it would also give a 100% match for, for e.g.
CONAIR
RCIAON
which might not be what you want.
The other way to define similarity would be to find out common substrings for 2 strings. You can create a Suffix Tree and find out all common substrings and try to determine how similar they are. So for your e.g. the suffix tree would give common substrings as CON & AIR which covers the whole word (for your 2 strings) and thus concluding them as similar.
Upvotes: 2
Reputation: 56941
Theoretically, the approach you are using is correct for the problem that you are trying to solve. But Levenstein would consider only the individual characters the two sets.
The string similarity could also be found using Longest Common Subsequence method and then you can see Levenstein on rest of the unmatched.
In case you want to do a clustered approach, the following answer seems to have some details, but obviously it is more difficult to implement.
Upvotes: 2
Reputation: 5993
It sounds like you may want to try doing Levenshtein distance using syllables or phonemes instead of letters.
Upvotes: 2
Reputation: 7307
I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.
Upvotes: 11