Hristo
Hristo

Reputation: 46477

what is a good metric for deciding if 2 Strings are "similar enough"

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.

What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".

However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?

Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?

UPDATE

I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.


So far, my algorithm (pseudo code) is roughly along the lines of:

/*
 * The input lists represent the Strings I want to test for similarity. The
 * Strings are split apart based on new lines / carriage returns because Java
 * stack traces are not a giant one-line String, rather a multi-line String.
 * So each element in the input lists is a "line" from its stack trace.
 */
calculate similarity (List<String> list1, List<String> list2) {

    length1 = 0;
    length2 = 0;
    levenshteinDistance = 0;

    iterator1 = list1.iterator();
    iterator2 = list2.iterator();

    while ( iterator1.hasNext() && iterator2.hasNext() ) {

        // skip blank/empty lines because they are not interesting
        str1 = iterator1.next();    length1 += str1.length();
        str2 = iterator2.next();    length2 += str2.length();

        levensteinDistance += getLevenshteinDistance(str1, str2);
    }

    // handle the rest of the lines from the iterator that has not terminated

    difference = levenshteinDistance / Math.max(length1, length2);

    return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

Upvotes: 23

Views: 4201

Answers (4)

Tudor
Tudor

Reputation: 62439

How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:

Take all the letters from both Strings an build a table like this:

Letter | String1 | String2

This can be a simple hash table or whatever.

In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).

It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:

C = (V1 * V2) / (|V1| * |V2|)

The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.

How close C is to 1 gives you how similar the Strings are.

It may seem complicated but it's just a few lines of code once you understand the idea.

Let's see an example: consider the strings

s1 = aabccdd
s2 = ababcd

The table looks like:

Letter a b c d
s1     2 1 2 2
s2     2 2 1 1

And thus:

C = (V1 * V2) / (|V1| * |V2|) = 
(2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877

So they are "pretty" similar.

Upvotes: 20

NPE
NPE

Reputation: 500347

Since the Levenshtein distance is never greater than the length of the longer string, I'd certainly change the denominator from (length1 + length2) to Math.max(length1, length2). This would normalize the metric to be between zero and one.

Now, it's impossible to answer what's "similar enough" for your needs based on the information provided. I personally try to avoid step functions like you have with the 0.25 cutoff, preferring continuous values from a known interval. Perhaps it would be better to feed the continuous "similarity" (or "distance") values into higher-level algorithms instead of transforming those values into binary ones?

Upvotes: 1

Kai Qing
Kai Qing

Reputation: 18833

Here's my take on this - just a long story to consider and not necessarily an answer to your problem:

I've done something similar in the past where I would try to determine if someone was plagiarizing by simply rearranging sentences while maintaining the same sort of message.

1 "children should play while we eat dinner"
2 "while we eat dinner, the children should play"
3 "we should eat children while we play"

So levenshtein wouldn't be of much use here because it is linear and each one would be considerably different. The standard difference would pass the test and the student would get away with the crime.

So I broke each word in the sentences up and recomposed the sentences as arrays, then compared each other to first determine if the word existed in each array, and where it was in relation to the last. Then each word would check the next in the array to determine if there were sequential words, like in my example sentences above line 1 and 2. So if there were sequential words, I would compose a string of each sequence common to each array and then attempt to find differences in the remaining words. The fewer remaining words, the more likely they are just filler to make it seem less plagiarized.

"while we eat dinner, I think the children should play"

Then "I think" is evaluated and considered filler based on a keyword lexicon - this part is hard to describe here.

This was a complex project that did a lot more than just what I described and not a simple chunk of code I can easily share, but the idea above is not too hard to replicate.

Good luck. I'm interested in what other SO members have to say about your question.

Upvotes: 2

Garrett Hall
Garrett Hall

Reputation: 30022

Stack traces are in a format amenable to parsing. I would just parse the stack traces using a parsing library and then you can extract whatever semantic content you want to compare.

Similarity algorithms are going to be slower and difficult to debug with when strings aren't comparing as you expect.

Upvotes: 4

Related Questions