Reputation: 11506
I want to match all block and multiline comments in a JavaScript file (these are C-Style comments). I have a pattern that works well. However, it creates some backtracking which slows it down significantly, especially on larger files.
Pattern: \/\*(?:.|[\r\n])*?\*\/|(?:\/\/.*)
Example: https://www.regex101.com/r/pR6eH6/2
How can I avoid the backtracking?
Upvotes: 3
Views: 191
Reputation: 626802
You have heavy backtracking because of the alternation. Instead of the (?:.|[\r\n])
, you may consider using a character class [\s\S]
that boosts performance to a noticeable extent:
\/\*[\s\S]*?\*\/|\/\/.*
See demo
In Python, you can use the re.S
/re.DOTALL
modifier to make .
match line breaks, too (note that the single line comment pattern should be matched with \/\/[^\r\n]*
then):
/\*.*?\*/|//[^\r\n]*
See another demo
However, since *?
lazy quantifier will also cause an overhead similar to the one caused by greedy quantifiers, you should consider using a much more optimal pattern for C style multiline comments - /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
, and the whole regex will now look like:
/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*
See yet another demo
Details:
/\*
- a /*
[^*]*
- zero or more chars other than *
\*+
- one or more asterisks(?:[^/*][^*]*\*+)*
- zero or more sequences of:
[^/*]
- a symbol other than /
and *
[^*]*
- zero or more symbols other than *
\*+
- 1+ asterisks/
- a /
symbol|
- or //.*
- //
and any 0+ chars other than than line break chars.Just wanted to note that in Python, you do not need to escape /
(in JS, you do not need to escape /
when declaring a regex using the RegExp constuctor).
NOTE: The last pattern does not allow simple capturing what is inside /*
and */
, but since the pattern is more stable than the rest, I'd advise using it even when you need to capture the contents with the trailing *
- /\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*)
- and then you'd need to remove the last char from .group(1)
.
Upvotes: 7
Reputation: 89557
What can you do with your pattern?
Your actual pattern is:
\/\*(?:.|[\r\n])*?\*\/|(?:\/\/.*)
or without useless backslashes and groups:
/\*(?:.|[\r\n])*?\*/|//.*
As stribizhev has explained (?:.|[^\r\n])*?
can be written in a more simple way using the DOTALL mode, i.e.: .*?
or without using [\s\S]
in place of the dot.
But you can do much better if you put in factor the first character /
that is in common for the two branches of your main alternation (the branch for multiline comments and the branch for singleline comments):
/(?:\*[\s\S]*?\*/|/.*)
The two advantages of this change:
Beginning a pattern with an alternation is not a good idea and must be avoided when possible, because the regex engine must test the two branches of the alternation (in the worst case) for each position in the string. So in your case (only two branches), you can consider that the regex engine work is X2.
If you put the first character (or more tokens if possible) in factor, the greatest part of uninteresting positions in the string are more quickly discarded (positions that doesn't start with a /
), since there is only one branch to test when the first character is not the good one.
When you start a pattern with a literal string, the regex engine is able to use a faster algorithm to directly find positions in the string where the pattern may succeed (the positions where the literal string appears). In your case, using this optimisation will make your pattern much faster.
Other thing you can improve: the non-greedy quantifier
A non-greedy quantifier is slow by nature (compared to a greedy quantifier) because each time it take a character, it must test if the end of the pattern succeeds or not (until the end of the pattern succeeds). In other words, a non-greedy quantifier can be worst than a greedy quantifier when the backtracking mechanism occurs (the backtracking mechanism and how quantifiers work is one of the more (the most?) important thing to understand, take the time for that).
You can rewrite the subpattern \*[\s\S]*?\*/
in a more efficient way:
\*[^*]*\*+(?:[^*/][^*]*\*+)*/
details:
\* # literal asterisk
[^*]* # zero or more character that are not an asterisk
\*+ # one or more asterisks: this one will match either the last asterisk(s)
# before the closing slash or asterisk(s) inside the comment.
(?:[^*/][^*]*\*+)* # In case there are asterisks(s) inside the comment, this
# optional group ensures the next character isn't a slash: [^*/]
# and reach the next asterisk(s): [^*]*\*+
/ # a literal slash
This subpattern is more longer but more efficient since it uses only greedy quantifiers, and has backtracking steps reduced to the minimum.
The pattern is now:
/(?:\*[^*]*\*+(?:[^*/][^*]*\*+)*/|/.*)
and only needs ~950 steps (instead of ~12500) to find the 63 occurrences of your example string.
Upvotes: 1