Fede Lerner
Fede Lerner

Reputation: 477

String similarity -> Levenshtein distance

I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:

CONAIR
AIRCON

The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.

I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.

Is there a better algorithm I can use? Or what do you guys recommend me?

EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).

Upvotes: 29

Views: 13183

Answers (7)

Yaswanth
Yaswanth

Reputation: 113

try using other similarity measures like sorenson,jaccard and jaro_winkler

personally i am huge fan of jaro winkler since it served my purpose many times.

from Levenshtein import jaro_winkler
In [2]: jaro_winkler("conair","aircon")
Out[2]: 0.8333333333333334

Upvotes: 2

monnoo
monnoo

Reputation: 311

Take a look to Needleman-Wunsch, or Smith-Waterman algorithms. They are used to handle string matching via adapted edit-distance for DNA sequences, where any kind of insertions, reversals, transposons may happen, to any length, at any place. Saying this, I need to add that for a sufficiently long string there is no optimal solution. And don't forget that the edit cost depend on the use-context of the algorithm (a semantics issue), while any algorithm is always a syntactical machine.

Upvotes: 1

Alexey Frunze
Alexey Frunze

Reputation: 62106

I think this can be easily solved by employing the Longest Common Substring/Subsequence algorithm on one of the strings (e.g. "conair") and the other string appended to itself once (e.g. "aircon" -> "airconaircon").

Sample code in C:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

// Returns the length of the longest common substring (LCS)
// between two given strings.
//
// This recursive implementation can be replaced by a
// more performant dynamic programming implementation.
size_t llcs(const char* s1, const char* s2)
{
  size_t len[3];

  if (*s1 == '\0' || *s2 == '\0') return 0;

  len[0] = (*s1 == *s2) + llcs(s1 + 1, s2 + 1);
  len[1] = llcs(s1 + 1, s2);
  len[2] = llcs(s1, s2 + 1);

  if (len[0] < len[1]) len[0] = len[1];
  if (len[0] < len[2]) len[0] = len[2];

  return len[0];
}

// Returns similarity of two given strings in the range
// from 0.0 to 1.0 (1.0 for equal strings).
double similarity(const char* s1, const char* s2)
{
  size_t s1len = strlen(s1);
  size_t s2len = strlen(s2);
  double sim;

  if (s1len == 0 && s2len == 0)
  {
    // Two empty strings are equal
    sim = 1;
  }
  else
  {
    size_t len;
    // Append s1 to itself in s1s1 (e.g. "aircon" -> "airconaircon")
    char* s1s1 = malloc(s1len * 2 + 1);
    strcpy(s1s1, s1);
    strcpy(s1s1 + s1len, s1);

    // Find the length of the LCS between s1s1 and s2
    // (e.g. between "airconaircon" and "conair")
    len = llcs(s1s1, s2);
    // We need it not longer than s1 (e.g. "aircon")
    // since we're actually comparing s1 and s2
    if (len > s1len) len = s1len;

    len *= 2;

    // Prevent 100% similarity between a string and its
    // cyclically shifted version (e.g. "aircon" and "conair")
    if (len == s1len + s2len && strcmp(s1, s2) != 0) len--;

    // Get the final measure of the similarity
    sim = (double)len / (s1len + s2len);

    free(s1s1);
  }

  return sim;
}

int main(int argc, char** argv)
{
  if (argc == 3)
    printf("Similarity of \"%s\" and \"%s\" is %.2f%%\n",
           argv[1], argv[2], 100 * similarity(argv[1], argv[2]));
  else
    printf("Usage:\n  %s string1 string2\n",
           argv[0]);
  return 0;
}

Sample output:

Similarity of "123" and "123" is 100.00%
Similarity of "123" and "1234" is 85.71%
Similarity of "0123" and "123" is 85.71%
Similarity of "a" and "aa" is 66.67%
Similarity of "aa" and "a" is 66.67%
Similarity of "aaaaaaa" and "aaaaaa" is 92.31%
Similarity of "aaaaaa" and "aaaaaaa" is 92.31%
Similarity of "aircon" and "conair" is 91.67%
Similarity of "spit" and "pits" is 87.50%
Similarity of "pits" and "spit" is 87.50%
Similarity of "spits" and "pits" is 88.89%
Similarity of "pits" and "spits" is 88.89%

Upvotes: 7

user1168577
user1168577

Reputation: 1973

Sorting the words and finding Levenshtein would give a 100% match for your example but it would also give a 100% match for, for e.g.

CONAIR
RCIAON

which might not be what you want.

The other way to define similarity would be to find out common substrings for 2 strings. You can create a Suffix Tree and find out all common substrings and try to determine how similar they are. So for your e.g. the suffix tree would give common substrings as CON & AIR which covers the whole word (for your 2 strings) and thus concluding them as similar.

Upvotes: 2

Senthil Kumaran
Senthil Kumaran

Reputation: 56941

Theoretically, the approach you are using is correct for the problem that you are trying to solve. But Levenstein would consider only the individual characters the two sets.

The string similarity could also be found using Longest Common Subsequence method and then you can see Levenstein on rest of the unmatched.

In case you want to do a clustered approach, the following answer seems to have some details, but obviously it is more difficult to implement.

Upvotes: 2

Sam Mussmann
Sam Mussmann

Reputation: 5993

It sounds like you may want to try doing Levenshtein distance using syllables or phonemes instead of letters.

Upvotes: 2

maniek
maniek

Reputation: 7307

I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.

Upvotes: 11

Related Questions