Reputation: 29683
I have text:
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>BBzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
I can't parse it as xml. I need to use regex here. Also this is only example.
I want regex that can match every group <a>...</a>
that does not contain element b
with text that starts with BB
.
I came up with this regex:
<a>.*?<b>(?!B).*?</b>.*?</a>
But it matches last group as:
<a>
sdfsdf
<b>BBzz</b>
sdfsdf
</a>
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
Which is bad for me.
How to write regex that will only match those 3 group from my given example?:
1.
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
2.
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
3.
<a>
sdfsdf
<b>DDzz</b>
sdfsdf
</a>
Upvotes: 1
Views: 83
Reputation: 627126
Use a tempered greedy token regex:
<a>(?:(?!<(?:b>BB|/?a>)).)*</a>
Enable the .
matches newline option.
Details:
<a>
- a literal <a>
char sequence(?:(?!<(?:b>BB|/?a>)).)*
- a tempered greedy token matching any char (.
) that is not the starting symbol of a sequence that can be matched with the pattern inside the (?!<(?:b>BB|/?a>))
lookahead (not a <b>BB
or </a>
or <a>
sequence)</a>
- a literal </a>
char sequenceUpvotes: 3