Reputation: 38

Faster way match characters between strings than Regex?

The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?

 var query = "word",
    words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
    reStoredMatches = [],
    indexOfMatches = [];

  function match(word, query) {
    var len = word.length,
      charMatches = [],
      charMatch,
      char;

    while (len--) {
      char = word[len];
      charMatch = query.indexOf(char);

      if (charMatch !== -1) {
        charMatches.push(char);
      }
    }
    return charMatches.length === query.length;
  }

  function linearIndexOf(words, query) {
    var wordsLen = words.length,
      wordMatch,
      word;

    while (wordsLen--) {
      word = words[wordsLen];

      wordMatch = match(word, query);

      if (wordMatch) {

        indexOfMatches.push(word);
      }
    }
  }

  function linearRegexStored(words, query) {
    var wordsLen = words.length,
      re = new RegExp('[' + query + ']', 'g'),
      match,
      word;

    while (wordsLen--) {
      word = words[wordsLen];
      match = word.match(re);

      if (match !== null) {

        if (match.length >= query.length) {

          reStoredMatches.push(word);
        }
      }
    }
  }

Upvotes: 1

Answers (3)

Luis Colorado

Reputation: 12668

The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.

For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.

if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.

Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.

Upvotes: 0

Alexis Wilke

Reputation: 20720

Note that your regex is wrong, that's most certainly why it goes so fast.

Right now, if your query is "word" (as in your example), the regex is going to be:

/[word]/g

This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)

Also, you mention the idea/concept of any number of characters, I suppose as shown here:

'word', 'wwoorrddss'

The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:

/w+o+r+d+s+/g

That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.

From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:

/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).

1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...

That's probably why your properly written test in plain JavaScript is much slower.

Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:

/d.*o.*r.*w/

And that will match any word that includes the word "word" such as "password".

One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)

One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:

query = query.toLowerCase();

early on in your top function.

Upvotes: 3

Victory

Reputation: 5890

You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match

  function match(word, query) {
    var len = word.length;
    while (len--) {
      if (query.indexOf(word[len]) === -1) { // found a missing char
        return false;
      }
    }
    return true; // couldn't find any missing chars
  }

This gives a 4-5X improvement

Depending on the application you could try presorting words and presorting each word in words as another optimization.

Upvotes: 0

Faster way match characters between strings than Regex?

Answers (3)

Related Questions