Hooch
Hooch

Reputation: 29683

Regex to match if given text is not found and match as little as possible

I have text:

<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>BBzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>

I can't parse it as xml. I need to use regex here. Also this is only example.

I want regex that can match every group <a>...</a> that does not contain element b with text that starts with BB.

I came up with this regex: <a>.*?<b>(?!B).*?</b>.*?</a> But it matches last group as:

<a>
sdfsdf
<b>BBzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>

Which is bad for me.

How to write regex that will only match those 3 group from my given example?:

1.

<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>

2.

<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>

3.

<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>

Upvotes: 1

Views: 83

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627126

Use a tempered greedy token regex:

<a>(?:(?!<(?:b>BB|/?a>)).)*</a>

Enable the . matches newline option.

Details:

  • <a> - a literal <a> char sequence
  • (?:(?!<(?:b>BB|/?a>)).)* - a tempered greedy token matching any char (.) that is not the starting symbol of a sequence that can be matched with the pattern inside the (?!<(?:b>BB|/?a>)) lookahead (not a <b>BB or </a> or <a> sequence)
  • </a> - a literal </a> char sequence

enter image description here

Upvotes: 3

Related Questions