szydan
szydan

Reputation: 2596

Could you explain why these regex is slow

Recently I found following regex in the code. As the string it was checking was quite large it was freezing the browser. Running some experiments I measured the times using following code

var s = 'Lorem ipsum dolor sit amet, rhoncus nam sem feugiat, vel vel, viverra ultrices interdum. Volutpat ac congue. Lacinia sit donec quis facilisi, magna cubilia volutpat lectus fusce ligula quis, sit sed vivamus eget mauris quisque, aenean aenean nec litora litora massa, malesuada turpis pretium. Magnis metus nulla mauris dictum ligula, odio facilisis nullam laoreet. Aliquam tincidunt enim sit dolor mi. Duis malesuada pede, tortor consectetuer facilisis massa et leo vel. Eget fames tellus mi. Suscipit tincidunt fusce lacus convallis, ornare eu sed eu gravida interdum. Vivamus ipsum, maecenas penatibus, lacus posuere, eu cum, mauris ea libero elit. Libero blandit mattis mi sapien, iaculis wisi sit convallis, est in libero, elementum cras in a cum a vestibulum';

for (var i = 0; i < 3; i++) {
  s += s;
}

start = new Date().getTime();
s.match(/AAA/i)
stop = new Date().getTime();
console.log("AAA took " + (stop - start) + " ms")

start = new Date().getTime();
s.match(/BBB/i)
stop = new Date().getTime();
console.log("BBB took " + (stop - start) + " ms")

start = new Date().getTime();
s.match(/CCC/i)
stop = new Date().getTime();
console.log("CCC took " + (stop - start) + " ms")

start = new Date().getTime();
s.match(/.*(AAA|BBB|CCC).*/i)
stop = new Date().getTime();
console.log("Combined took " + (stop - start) + " ms")

The above print

AAA took 0 ms
BBB took 1 ms
CCC took 0 ms
Combined took 53 ms

Could you explain in plain words why this regex is so slow, while checking for individual parts takes almost no time? Is there another way of writing a one line regex to check for occurrence of multiple strings which would produce the results faster?

Upvotes: 2

Views: 448

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626952

The /AAA/i, /BBB/i, /CCC/i differ from /.*(AAA|BBB|CCC).*/i not only in the use of alternation capturing group, but also in .* patterns.

The /.*(AAA|BBB|CCC).*/i pattern slows down matching due to the first .* that is a greedy dot pattern. It works in the following way:

  • It grabs all chars it can match (0 or more chsrs other than line break chars), so, basically, it grabs the whole line
  • Then it must match either AAA or BBB or CCC, thus backtracking starts: the engine yields a char from the end of the match and tries to match one of the alternatives
  • If the alternatives are missing, or if they are at the start of a very long string, the engine will work in vain trying to accommodate a part of a string for the capturing group match.

See this regex debugger with visualized backtracking steps.

So, the best way to find if a line / string contains any of the 3 alternatives is just to use

/(?:AAA|BBB|CCC)/i

since this regex can find partial matches (it does not require the full string match as in Java's String#match()).

If you need to find lines in a multiline string having one of the alternatives, it is a good idea to process line by line (say, split with \n) and then .filter(x => /(?:AAA|BBB|CCC)/i.test(x)).

Note (?:...) is a non-capturing group that does not create submatches, and there is less overhead since the engine does not have to allocate additional memory for the captures.

Upvotes: 1

Zenoo
Zenoo

Reputation: 12880

You didn't include the .* in your previous tests, resulting in your match returning false very quickly.

Here, you can see each test takes about 30ms to run.

The global still takes only 40ms.

Beyond all that, using s.match(/AAA|BBB|CCC/i) will be way faster.

var s = 'Lorem ipsum dolor sit amet, rhoncus nam sem feugiat, vel vel, viverra ultrices interdum. Volutpat ac congue. Lacinia sit donec quis facilisi, magna cubilia volutpat lectus fusce ligula quis, sit sed vivamus eget mauris quisque, aenean aenean nec litora litora massa, malesuada turpis pretium. Magnis metus nulla mauris dictum ligula, odio facilisis nullam laoreet. Aliquam tincidunt enim sit dolor mi. Duis malesuada pede, tortor consectetuer facilisis massa et leo vel. Eget fames tellus mi. Suscipit tincidunt fusce lacus convallis, ornare eu sed eu gravida interdum. Vivamus ipsum, maecenas penatibus, lacus posuere, eu cum, mauris ea libero elit. Libero blandit mattis mi sapien, iaculis wisi sit convallis, est in libero, elementum cras in a cum a vestibulum'

for (var i = 0; i < 3; i++) {
  s += s;
}

start = new Date().getTime();
s.match(/.*AAA.*/i)
stop = new Date().getTime();
console.log("AAA took " + (stop - start) + " ms")

start = new Date().getTime();
s.match(/.*BBB.*/i)
stop = new Date().getTime();
console.log("BBB took " + (stop - start) + " ms")

start = new Date().getTime();
s.match(/.*CCC.*/i)
stop = new Date().getTime();
console.log("CCC took " + (stop - start) + " ms")

start = new Date().getTime();
s.match(/.*(AAA|BBB|CCC).*/i)
stop = new Date().getTime();
console.log("Combined took " + (stop - start) + " ms")

Upvotes: 0

Related Questions