user3062572
user3062572

Reputation: 126

Can someone explain these regex results

I was testing out some random regex and came across some weird results. Say we have the regular expression (ab|(ba)*|a)* It does not match aba but if I remove the inner star, (ab|(ba)|a)* or if I switch the ordering of the terms, (a|ab|(ba)*)* these two cases now match aba. So why is this the case? Is it something to do with ambiguity or the nested *? I know its a weird test case and the inner * is redundant but I just want to understand these results. I was using regex101.com to test.

Upvotes: 2

Views: 76

Answers (2)

Shashank
Shashank

Reputation: 13869

The alternation operator (|) is short-circuiting and will always try to match the left-most possible subpattern until that one fails, at which time it will attempt to match the next one. Only non-overlapping patterns can be matched. An empty-string match causes the current greedy pattern to end, because empty strings can be matched infinitely, and it doesn't make sense to keep doing so, greedy or not. Greedy does not necessarily mean stupid. :)

So in the case of the pattern (ab|(ba)*|a)*, and the string 'aba', it will match 'ab' from the beginning of the string. Since you're using a greedy quantifier on the outermost capture group, *, the regex will continue trying to make a longer match with the outermost capture group. The match iterator will be at the 3rd character, and it will try to match 'ab', but it will fail. Then, upon realizing that it can potentially match (ba)* an infinite amount of times with the empty string, it will end the match (without capturing anything with (ba)* and without attempting to match the last alternative pattern, a) and return the last iteration of the outermost repeated capturing group.

Now if you switch the ordering of the subpatterns linked with the alternation operator like (ab|a|(ba)*)*, that will match the whole string, since the matcher is able to advance the match iterator with a, and then completes the match with a final empty-string match of the 3rd alternative subpattern.

(ab|(ba)|a)* also works because the second alternative can't be matched with the empty string, so as soon as it fails to match ba, it successfully moves on to attempt to match a.

Another similar way to fix it would be to use (ab|(ba)+|a)*. This will correctly cause the second alternative to fail properly instead of matching it.

A final way to fix it is to use the anchor to the end of the string, commonly represented by $. The pattern (ab|(ba)*|a)*$ is able to correctly fail on matching the second alternative, by realizing that it will never reach the end of the string by doing so. It will still match the second alternative eventually, but only after the match iterator has traversed to the end of the string.

That's why you see only one capture with the string 'aba' from your outermost capture group. The pattern (ba)* will always match from index 2-2 (or any empty substring for that matter), which then ends the current match and prevents the next a from matching, but will not capture anything unless you have an explicit 'ba' in your string that doesn't overlap with any earlier alternatives.

Upvotes: 3

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477180

Your assumption is false: it matches aba, see here.

The point is that there is a difference in "what the regex" prefers to match. If you however force the regex to match from start-to-end, it will match aba completely.

Some more detail: if you use the disjunction pattern (for instance r|s with r and s other regexes): the regex "likes" to select the left regex r over the right regex s. For instance if the regex says (a|aa)* and the input is aa, one can match the first item twice, or the use the second one. In that case, the regex likes to select the first item twice.

The same holds for repetitions, a regex wants to repeat the item within the Kleene star r* as much as possible.

Upvotes: 2

Related Questions