Simplifying Regular Expression Quantifiers

Question

I'm currently working on a library that simplifies creating regular expression patterns.

To generate the most legible patterns, I'd like to simplify quantifiers where possible. Assume the following subpattern has been emitted:

(?:\d+)?

The above can be simplified to \d*, but is it also correct to assume that (?:ℝ+)? can always be simplified to ℝ*, where ℝ is an arbitrary (parenthesized, if necessary) regular expression?

If yes, the same should hold true for the following, right?

(?:ℝ+)? => ℝ*
(?:ℝ+)* => ℝ*
(?:ℝ+)+ => ℝ+
(?:ℝ*)? => ℝ*
(?:ℝ*)* => ℝ*
(?:ℝ*)+ => ℝ*
(?:ℝ?)? => ℝ?
(?:ℝ?)+ => ℝ*
(?:ℝ?)* => ℝ*

Bergi · Accepted Answer

Yes, you're correct. And you should always use the smaller one, as the nested repetitions are prone to catastrophic backtracking. So apart from different execution behavior, they will match the same languages.

Simplifying Regular Expression Quantifiers

Answers (2)

Related Questions