stefan
stefan

Reputation: 1569

Efficiently determine "how sorted" a list is, eg. Levenshtein distance

I'm doing some research on ranking algorithms, and would like to, given a sorted list and some permutation of that list, calculate some distance between the two permutations. For the case of the Levenshtein distance, this corresponds to calculating the distance between a sequence and a sorted copy of that sequence. There is also, for instance, the "inversion distance", a linear-time algorithm of which is detailed here, which I am working on implementing.

Does anyone know of an existing python implementation of the inversion distance, and/or an optimization of the Levenshtein distance? I'm calculating this on a sequence of around 50,000 to 200,000 elements, so O(n^2) is far too slow, but O(n log(n)) or better should be sufficient.

Other metrics for permutation similarity would also be appreciated.


Edit for people from the future:

Based on Raymond Hettinger's response; it's not Levenshtein or inversion distance, but rather "gestalt pattern matching" :P

from difflib import SequenceMatcher
import random
ratings = [random.gauss(1200, 200) for i in range(100000)]
SequenceMatcher(None, ratings, sorted(ratings)).ratio()

runs in ~6 seconds on a terrible desktop.

Edit2: If you can coerce your sequence into a permutation of [1 .. n], then a variation of the Manhattan metric is extremely fast and has some interesting results.

manhattan = lambda l: sum(abs(a - i) for i, a in enumerate(l)) / (0.5 * len(l) ** 2)
rankings = list(range(100000))
random.shuffle(rankings)
manhattan(rankings) # ~ 0.6665, < 1 second

The normalization factor is technically an approximation; it is correct for even sized lists, but should be (0.5 * (len(l) ** 2 - 1)) for odd sized lists.

Edit3: There are several other algorithms for checking list similarity! The Kendall Tau ranking coefficient and the Spearman ranking coefficient. Implementations of these are available in the SciPy library as scipy.stats.kendalltau and scipy.stats.rspearman, and will return the ranks along with the associated p-values.

Upvotes: 15

Views: 2604

Answers (1)

Raymond Hettinger
Raymond Hettinger

Reputation: 226316

Levenshtein distance is an O(n**2) algorithm, so if you want to go faster, use the alternative fast algorithm in the difflib module. The ratio method computes a measure of similarity between two sequences.

If you have to stick with Levenshtein, there is a Python recipe for it on the ASPN Python Cookbook: http://code.activestate.com/recipes/576874-levenshtein-distance/ .

Another Python script can be found at: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

Upvotes: 4

Related Questions