parxier
parxier

Reputation: 3871

JavaScript regex exec takes too long to execute

I've got a simple JavaScript regex check (written by other developer) that works perfectly on thousands of different strings. However I've just discovered one particular string value that's causing that regex to take as long as 10min to execute in Firefox/IE which is unacceptable. I've extracted an actual regex call into small code snippet for your convenience:

<html>
  <script>
    function dodo(){
      var mask = /^([\w'#@\-\&\(\)\/.]+[ ]*){1,100}$/;
      var value = "Optometrists Association Australia, Queensland/NT Division";
      mask.exec(value);
    }
  </script>
  <body>
    <input type="button" value="Click" onclick="dodo()">
  </body>
</html>

What is the problem here? If I change value to anything else it works perfectly.

Thank you!

Upvotes: 3

Views: 2815

Answers (4)

Andy Ross
Andy Ross

Reputation: 12043

This looks like a poor application for a regex, and a poor regex to boot. The intent seems to be to match a list of between 1 and 100 space-separated "words", I think. Here are the core problems I can see:

  1. The use of "[ ]*" at the end of the word, instead of "[ ]+" means that every byte can potentially be a "word" alone, whether it's bounded by spaces or not. That's a lot of match cases for your engine to keep track of.

  2. You're using capturing parentheses ("(...)") instead non-capturing ones ("(?:...)"). The grouping will be doing yet more bookeeping to save the last word matched for you, which you probably doing need or not.

And some minor issues:

  1. The "[ ]*" expression is redundant. Just use " *" to match zero or more spaces. But you probably want "\s" there, to match whitespace of any type, not just a space.

  2. The expression allows whitespace at the end of the string, but not the beginning. Most applications usually want to tolerate both or neither.

  3. For readability, don't use backslash escaping where it's not needed. Only the "-" in your bracket actually needs it.

  4. What's magic about 100? Do you really want to hard-code that limit?

Finally, why use a regex here at all? Why not simply split() on whitespace into an array of substrings, and then test each resulting word against a simpler expression?

Upvotes: 4

ojrac
ojrac

Reputation: 13421

You're running in to crazy backtracking, a common feature in regexes that includes something of the form ([characters]+)+ -- it runs great for all sorts of matching patterns, but then you find a string like this, which makes it explode, recursing all over the string. Here's a sketch of what happens.

To start out, your pattern splits the string into groups. I use a | to start instances of your group, the one you repeat {1,100}. > is the end of a group, and ? is the regex parser's "cursor."

|----------->|---------->|-------?
Optometrists Association Australia, Queensland/NT Division

At the ?, your pattern can't match any more symbols or whitespace, so it tries to match $. Since the cursor hasn't reached the end of the line yet, it fails, and the regex parser backtracks:

|----------->|---------->|------?
Optometrists Association Australia, Queensland/NT Division

Once again, it can't find any whitespace, so it terminates the group, and tries to start another one (since there can be up to 100, and we've only used 3 so far).

|----------->|---------->|------|-?
Optometrists Association Australia, Queensland/NT Division

The parser has reached the problematic , again, and it kills that execution tree, causing it to backtrack once more, to the i in Australia. And, just like last time, it tries to start a group:

|----------->|---------->|-----|--?
Optometrists Association Australia, Queensland/NT Division

...anyway, you get the idea. This cycle of fail, backtrack and slice again will effectively freeze up your Regex parser until it's exhausted every permutation and returned false. The key to recognizing and fixing this is to never repeat a repeated group without some form of delimiter at the beginning and/or end. I'd suggest using the word boundary anchor \b, since [ ]+ would require your strings to end in whitespace:

/^(\b[\w'#@\-\&\(\)\/.]+\b[ ]*){1,100}$/

As a side-note, it's hard to tell what your regex is doing without more context, but it seems like you could also just call value.split(' ') to split the string on whitespace characters, and run a simpler regex on all those substrings. It'd eliminate the need for the double regex repeat.

Upvotes: 7

Oren Trutner
Oren Trutner

Reputation: 24208

You probably meant to have a + after the space group, rather than *. If you replace it back with a +, things go much faster. The * causes the regex evaluator to try a huge number of combinations, all of which fail when they reach the ','. You might want to add a ',' to the first character group too.

Overall, it might look like this:

var mask = /^([\w'#@\-\&\(\)\/.,]+[ ]+){1,100}$/;

Upvotes: 6

Luke Schafer
Luke Schafer

Reputation: 9265

removing the comma from the string or adding it to the character group makes it execute quickly, but without examples of correct operation or an explanation of what you're trying to achieve I can't say for sure if it works correctly...

Upvotes: 1

Related Questions