Philipp123
Philipp123

Reputation: 173

Is there an R function for computing fast Levenshtein distance with threshold (maxDist)

I am looking for an R function, which returns the Levenshtein distance of two strings if the Levenshtein distance is less than a threshold, and saves time by not computing Levenshtein distances larger than the threshold. The threshold is given and should be in the range somewhere between 2 and 10. At first, I thought a lot of computing time can be saved by using a threshold, but I'm not so sure about this any more. I tried using amatch from the stringdist package with maxDist argument, but it does not seem to speed up the algorithm compared to not using the threshold.

Upvotes: 0

Views: 299

Answers (1)

Samet Sökel
Samet Sökel

Reputation: 2670

There is a package named RecordLinkage which includes levenshteinSim and levenshteinDist function.

This package is out of date but can be installed with old versions of source files.

Here is levenshteinSim function explanation of RecordLinkage package manifest;

Details
String metrics compute a similarity value in the range [0, 1] for two strings, with 1 denoting the
highest (usually equality) and 0 denoting the lowest degree of similarity. In the context of Record
Linkage, string similarities can improve the discernibility between matches and non-matches.
jarowinkler is an implementation of the algorithm by Jaro and Winkler (see references). For the
meaning of W_1, W_2, W_3 and r see the referenced article. For most applications, the default values
are reasonable.
levenshteinDist returns the Levenshtein distance, which cannot be directly used as a valid string
comparator. levenshteinSim is a similarity function based on the Levenshtein distance, calculated
by 1 −
d(str1,str2)
max(A,B)) , where d is the Levenshtein distance function and A and B are the lenghts of the
strings.
Arguments str1 and str2 are expected to be of type "character".

And you can install packages from source files with .tar.gz in that way;

How do I install an R package from source?

Upvotes: 1

Related Questions