Normalizing the edit distance

I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two will be counted as well. for eg: ed('has a', 'has a ball') = 4 and ed('has a', 'has a ball the is round') = 15. if we increase the length of the string, the edit distance will increase even though they are similar. Therefore, I can not set a value, what a good edit distance value should be.

Upvotes: 28

Answers (3)

Sam Tseng

Reputation: 178

I had used a normalized edit distance or similarity (NES) which I think is very useful, defined by Daniel Lopresti and Jiangyin Zhou, in Equation (6) of their work: http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf.

The NES in python is:

import math
def normalized_edit_similarity(m, d):
    # d : edit distance between the two strings
    # m : length of the shorter string
    return ( 1.0 / math.exp( d / (m - d) ) )

print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))

1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592

More examples can be found in Table 2 in the above paper.

The variable m in the above function can be replaced with the length of the longer string, depending on your application.

Upvotes: 3

Martin

Reputation: 101

I used the following successfully:

len = std::max(s1.length(), s2.length());
// normalize by length, high score wins
fDist = float(len - levenshteinDistance(s1, s2)) / float(len);

Then chose the highest score. 1.0 means an exact match.

Upvotes: 10

Anton

Reputation: 3203

Yes, normalizing the edit distance is one way to put the differences between strings on a single scale from "identical" to "nothing in common".

A few things to consider:

Whether or not the normalized distance is a better measure of similarity between strings depends on the application. If the question is "how likely is this word to be a misspelling of that word?", normalization is a way to go. If it's "how much has this document changed since the last version?", the raw edit distance may be a better option.
If you want the result to be in the range [0, 1], you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.
The normalized distance is not a metric, as it violates the triangle inequality.

Upvotes: 36

Normalizing the edit distance

Answers (3)

Related Questions