sqd
sqd

Reputation: 1535

Why can "a*a+" and "(a{2,3})*a{2,3}" match "aaaa" while "(a{2,3})*" cannot?

My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary. Therefore, in a*a+, a* would give one (or maybe more?) character back to a+ so it can match.

However, in (a{2,3})*, why doesn't the first "instance" of a{2,3} gives a character to the second "instance" so the second one can match?

Also, in (a{2,3})*a{2,3} the first part does seem to give a character to the second part.

Upvotes: 4

Views: 264

Answers (3)

ipid
ipid

Reputation: 652

A simple workaround for your question is to match aaaa with regex ^(a{2,3})*$.

Your problem is that:

In the case of (a{2,3})*, regex doesn't seem to consume as much character as possible.

I suggest not to think in giving back characters. Instead, the key is acceptance.

Once regex accept your string, the matching will be over. The pattern a{2,3} only matches aa or aaa. So in the case of matching aaaa with (a{2,3})*, the greedy engine would match aaa. And then, it can't match more a{2,3} because there is only one a remained. Though it's able for regex engine to do backtrack and match an extra a{2,3}, it wouldn't. aaa is now accepted by the regex, thus regex engine would not do expensive backtracking.

If you add an $ to the end of the regex, it simply tells regex engine that a partly match is unacceptable. Moreover, it's easy to explain the (a{2,3})*a{2,3} case with accepting and backtracking.

Upvotes: 2

Nahuel Fouilleul
Nahuel Fouilleul

Reputation: 19315

backtracking is done only if match fails however aaa is a valid match, a negative lookahead (?!a) can be use to prevent the match be followed by a a.

compare

(aaa?)*

and

(aaa?)*(?!a)

Upvotes: 0

slebetman
slebetman

Reputation: 113866

The main problem is this:

My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary

This is completely wrong. It is not what greedy means.

Greedy simply means "use the longest possible match". It does not give anything back.

Once you interpret the expressions with this new understanding everything makes sense.


  1. a*a+ - zero or more a followed by one or more a

  2. (a{2,3})*a{2,3} - zero or more of either two or three a followed by either two or three a (note: the KEY THING to remember is "zero or more", the first part not matching any character is considered a match)

  3. (a{2,3})* - zero or more of either two or three a (this means that after matching three as the last single a left cannot match)

Upvotes: 1

Related Questions