Thomas
Thomas

Reputation: 480

Why does this regex matches as few characters as possible even without question mark?

I'm using this expression:

(.*(?:<br\/>(?:<\/p>)?\n.*)+)

On this example text:

test text <br/>
test line2 <br/>
test line3 <br/>
test line4

Instead of giving me this as a complete match, it's splitted in two matches (when using g flag, otherwise it's only the first match):

MATCH ONE: 
test text <br/>
test line2 <br/>

MATCH TWO:
test line3 <br/>
test line4

(Link to example: https://www.regex101.com/r/lS9vV7/3)

Edit: This expression should match the whole string, instead of splitting it in two matches

Upvotes: 2

Views: 80

Answers (2)

nhahtdh
nhahtdh

Reputation: 56819

Instead of content (br \n content)*, change it to (content br \n)* content:

(?:.*<br\/>(?:<\/p>)?\n)+.*

Demo on regex101

The original regex and the solution above has equivalent matching power, i.e. if you anchor the regex, the two solutions match the same language (set of strings that satisfies the grammar defined by the regular expression). However, due to the backtracking mechanism and the order of exploring the search tree in a backtracking engine, the result differs.

After greedy quantifier (e.g. *, +, {n,}, {n,m}) satisfies the lower bound of repetition, it will try to match the atom as many times as possible, and on failure to match the next atom, it stops the repetition and continue on to the sequel pattern. While it can backtrack into the atom, and also undo the repetition, backtracking only occurs when the sequel pattern fails. In our case, there is no sequel pattern (in other words, we accept the match).

As analyzed in the other answer, the second .* in (.*(?:<br\/>(?:<\/p>)?\n.*)+) can match <br/>, which means that there is no </br> for the next repetition. Due to the backtracking mechanism as described above, the quantifier + stops trying for more, and the match is accepted (since there is no sequel pattern).

(As an example of sequel, when you add anchors \z at the end, \z is the sequel, preventing the match to happen in the middle of the input string).

In my solution, in order to stop the outer repetition from repeating, the pattern .*<br\/>(?:<\/p>)?\n has to fail, which means that it has to try all possibilities by backtracking. This allows .* to backtrack to match <br/> at the end of a line.

Upvotes: 2

m.cekiera
m.cekiera

Reputation: 5395

Try with:

(.*(?:<br\/>(?:<\/p>)?\n[^<]*)+)

DEMO

I think your regex didn't work, because .+ which is after \n, matched also next <br/> part (look here), so the (?:<br\/>(?:<\/p>)?\n[^<]*)+ did't work multiple times. If you replace [^<]+ for .+, it will not match <br/> and it should work as you intended (at least I hope so).

Upvotes: 1

Related Questions