Parse C-Style Comments with Regex, avoid Backtracking

Question

I want to match all block and multiline comments in a JavaScript file (these are C-Style comments). I have a pattern that works well. However, it creates some backtracking which slows it down significantly, especially on larger files.

Pattern: /\*(?:.|[ ])*?\*/|(?://.*)

Example: https://www.regex101.com/r/pR6eH6/2

How can I avoid the backtracking?

Wiktor Stribiżew · Accepted Answer

You have heavy backtracking because of the alternation. Instead of the (?:.|[ ]), you may consider using a character class [\s\S] that boosts performance to a noticeable extent:

/\*[\s\S]*?\*/|//.*

See demo

In Python, you can use the re.S/re.DOTALL modifier to make . match line breaks, too (note that the single line comment pattern should be matched with //[^ ]* then):

/\*.*?\*/|//[^
]*

See another demo

However, since *? lazy quantifier will also cause an overhead similar to the one caused by greedy quantifiers, you should consider using a much more optimal pattern for C style multiline comments - /\*[^*]*\*+(?:[^/*][^*]*\*+)*/, and the whole regex will now look like:

/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*

See yet another demo

Details:

/\* - a /*
[^*]* - zero or more chars other than *
\*+ - one or more asterisks
(?:[^/*][^*]*\*+)* - zero or more sequences of:
- [^/*] - a symbol other than / and *
- [^*]* - zero or more symbols other than *
- \*+ - 1+ asterisks
/ - a / symbol
| - or
//.* - // and any 0+ chars other than than line break chars.

Just wanted to note that in Python, you do not need to escape / (in JS, you do not need to escape / when declaring a regex using the RegExp constuctor).

NOTE: The last pattern does not allow simple capturing what is inside /* and */, but since the pattern is more stable than the rest, I'd advise using it even when you need to capture the contents with the trailing * - /\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*) - and then you'd need to remove the last char from .group(1).

Parse C-Style Comments with Regex, avoid Backtracking

Answers (2)

Related Questions