Lemex
Lemex

Reputation: 3794

Optimizing String Matching Algorithm

function levenshtein(a, b) {
  var i,j,cost,d=[];

  if (a.length == 0) {return b.length;}
  if (b.length == 0) {return a.length;}

  for ( i = 0; i <= a.length; i++) {
    d[i] = new Array();
    d[ i ][0] = i;
  }

  for ( j = 0; j <= b.length; j++) {
    d[ 0 ][j] = j;
  }

  for ( i = 1; i <= a.length; i++) {
    for ( j = 1; j <= b.length; j++) {
      if (a.charAt(i - 1) == b.charAt(j - 1)) {
        cost = 0;
      } else {
        cost = 1;
      }

      d[ i ][j] = Math.min(d[ i - 1 ][j] + 1, d[ i ][j - 1] + 1, d[ i - 1 ][j - 1] + cost);

      if (i > 1 && j > 1 && a.charAt(i - 1) == b.charAt(j - 2) && a.charAt(i - 2) == b.charAt(j - 1)) {
        d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost)
      }
    }
  }

  return d[ a.length ][b.length];
}

function suggests(suggWord) {
  var sArray = [];
  for(var z = words.length;--z;) {
    if(levenshtein(words[z],suggWord) < 2) { 
      sArray.push(words[z]);
    }   
  }
}

Hello.

I'm using the above implementation of Damerau-Levenshtein algorithm. Its fast enough on a normal PC browser, but on a tablet it takes ~2/3 seconds.

Basically, I'm comparing the word sent to a suggest function to every word in my dictionary, and if the distance is less than 2 adding it to my array.

The dic is an array of words approx size 600,000 (699KB) The aim of this is to make a suggest word feature for my Javascript spell checker.

Any suggestion on how to speed this up? Or a different way of doing this?

Upvotes: 2

Views: 3540

Answers (4)

One thing you can do if you are only looking for distances less than some threshold is to compare the lengths first. For example, if you only want distances less than 2, then the absolute value of the difference of the two strings' lengths must be less than 2 as well. Doing this will often allow you to avoid even doing the more expensive Levenshtein calculation.

The reasoning behind this is that two strings that differ in length by 2, will require at least two insertions (and thus a resulting minimum distance of 2).

You could modify your code as follows:

function suggests(suggWord) {
  var sArray = [];
  for(var z = words.length;--z;) {
    if(Math.abs(suggWord.length - words[z].length) < 2) {
      if (levenshtein(words[z],suggWord) < 2) { 
        sArray.push(words[z]);
      }
    }   
  }
}

I don't do very much javascript, but I think this is how you could do it.

Part of the problem is that you have a large array of dictionary words, and are doing at least some processing for every one of those words. One idea would be to have a separate array for each different word length, and organize your dictionary words into them instead of one big array (or, if you must have the one big array, for alpha lookups or whatever, then use arrays of indexes into that big array). Then, if you have a suggWord that's 5 characters long, you only have to look through the arrays of 4, 5, and 6 letter words. You can then remove the Match.Abs(length-length) test in my code above, because you know you are only looking at the words of the length that could match. This saves you having to do anything with a large chunk of your dictionary words.

Levenshtein is relatively expensive, and more so with longer words. If it is simply the case that Levenshtein is too expensive to do very many times, especially with longer words, you may leverage off another side effect of your threshold of only considering words that either exactly match or that have a distance of 1 (one insertion, deletion, substitution, or transposition). Given that requirement, you can further filter candidates for the Levenshtein calculation by checking that either their first character matches, or their last character matches (unless either word has a length of 1 or 2, in which case Levensthein should be cheap to do). In fact, you could check for a match of either the first n characters or the last n characters, where n = (suggWord.length-1)/2. If they don't pass that test, you can assume that they won't match via Levenshtein. For this you would want primary array of dictionary words ordered alphabetically, and in addition, an array of indexes into that array, but ordered alphabetically by their reversed characters. Then you could do a binary search into both of those arrays, and only have to do Levenshtein calculation on the small subset of words whose n characters of their start or end match the suggWord start or end, and that have a length that differs by at most one character.

Upvotes: 4

user1168577
user1168577

Reputation: 1973

You should store all the words in a trie. This is space efficient when compared to dictionary storing words. And the algorithm to match a word would be to traverse the trie (which marks the end of the word) and get to the word.

Edit

Like I mentioned in my comment. For Levenshtein distance of 0 or 1 you don't need to go through all the words. Two words have Levenshtein distance of 0 if they are equal. Now the problem boils down to predicting all the words which will have Levenshtein distance of 1 for a given word. Let's take an example:

array

For the above word if you want to find Levenshtein distance of 1, the examples will be

  • parray, aprray, arpray, arrpay, arrayp (Insertion of a character)

Here p can be substituted by any other letter.

Also for these words, Levenshtein distance is 1

rray, aray, arry (Deletion of a character)

And finally for these words:

prray, apray, arpay, arrpy and arrap (Substitution of a character)

Here again, p can be substituted with any other letter.

So if you look up for these particular combinations only and not all the words, you will get to your solution. If you know how a Levenshtein algorithm works, we have reverse engineered it.

A final example which is your usecase:

If pary is the word which you get as input and which should be corrected to part from the dictionary. So for pary you don't need to look at words starting with ab for e.g. because for any word starting with ab, Levenshtein distance will be greater than 1.

Upvotes: 2

austincheney
austincheney

Reputation: 1199

There are some simple things you can do in your code to RADICALLY improve execution speed. I completely rewrote your code for performance, static typing compliance with JIT interpretation, and JSLint compliance:

var levenshtein = function (a, b) {
        "use strict";
        var i = 0,
            j = 0,
            cost = 1,
            d = [],
            x = a.length,
            y = b.length,
            ai = "",
            bj = "",
            xx = x + 1,
            yy = y + 1;
        if (x === 0) {
            return y;
        }
        if (y === 0) {
            return x;
        }
        for (i = 0; i < xx; i += 1) {
            d[i] = [];
            d[i][0] = i;
        }
        for (j = 0; j < yy; j += 1) {
            d[0][j] = j;
        }
        for (i = 1; i < xx; i += 1) {
            for (j = 1; j < yy; j += 1) {
                ai = a.charAt(i - 1);
                bj = b.charAt(j - 1);
                if (ai === bj) {
                    cost = 0;
                } else {
                    cost = 1;
                }
                d[i][j] = Math.min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + cost);
                if (i > 1 && j > 1 && ai === b.charAt(j - 2) && a.charAt(i - 2) === bj) {
                    d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost);
                }
            }
        }
        return d[x][y];
    };

Looking up the length of the array at each interval of a multidimensional lookup is very costly. I also beautified your code using http://prettydiff.com/ so that I could read it in half the time. I also removed some redundant look ups in your arrays. Please let me know if this executes faster for you.

Upvotes: 2

Blacksad
Blacksad

Reputation: 15422

I had to optimize the same algorithm. What worked best for me was to cache the d Array.. you create it with big size (the maximum length of the strings you expect) outside of the levenshtein function, so each time you call the function you don't have to reinitialize it.

In my case, in Ruby, it made a huge difference in performance. But of course it depends on the size of your words array...

function levenshtein(a, b, d) {

var i,j,cost;

if (a.length == 0) {return b.length;}
if (b.length == 0) {return a.length;}

for ( i = 1; i <= a.length; i++) {

    for ( j = 1; j <= b.length; j++) {

        if (a.charAt(i - 1) == b.charAt(j - 1)) {

            cost = 0;

        } else {

            cost = 1;

        }

        d[ i ][j] = Math.min(d[ i - 1 ][j] + 1, d[ i ][j - 1] + 1, d[ i - 1 ][j - 1] + cost);

        if (i > 1 && j > 1 && a.charAt(i - 1) == b.charAt(j - 2) && a.charAt(i - 2) == b.charAt(j - 1)) {

            d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost)

        }

    }

}

return d[ a.length ][b.length];

}

function suggests(suggWord)
{
d = [];
for ( i = 0; i <= 999; i++) {

    d[i] = new Array();

    d[ i ][0] = i;

}
for ( j = 0; j <= 999; j++) {

    d[ 0 ][j] = j;

}


var sArray = [];
for(var z = words.length;--z;)
{
        if(levenshtein(words[z],suggWord, d) < 2)
        {sArray.push(words[z]);}    
}
}

Upvotes: 3

Related Questions