Reputation: 16470
I'm currently working on a library that simplifies creating regular expression patterns.
To generate the most legible patterns, I'd like to simplify quantifiers where possible. Assume the following subpattern has been emitted:
(?:\d+)?
The above can be simplified to \d*
, but is it also correct to assume that (?:ℝ+)?
can always be simplified to ℝ*
, where ℝ
is an arbitrary (parenthesized, if necessary) regular expression?
If yes, the same should hold true for the following, right?
(?:ℝ+)?
=> ℝ*
(?:ℝ+)*
=> ℝ*
(?:ℝ+)+
=> ℝ+
(?:ℝ*)?
=> ℝ*
(?:ℝ*)*
=> ℝ*
(?:ℝ*)+
=> ℝ*
(?:ℝ?)?
=> ℝ?
(?:ℝ?)+
=> ℝ*
(?:ℝ?)*
=> ℝ*
Upvotes: 2
Views: 106
Reputation: 362137
I'd suggest converting the quantifiers into (m, n) repetition counts:
*
is (0, ∞),+
is (1, ∞), and?
is (0, 1).Then work out the rules for combining quantifiers. In general, what is (ℝ{m1, n1}){m2, n2}
? Off the top of my head, I'd guess you could simplify that to ℝ{m1·m2, n1·n2}
if neither m1 nor m2 is greater than 1.
Upvotes: 3
Reputation: 665517
Yes, you're correct. And you should always use the smaller one, as the nested repetitions are prone to catastrophic backtracking. So apart from different execution behavior, they will match the same languages.
Upvotes: 1