Reputation: 639
I have this code:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>.*?text.*?</b>', a).group()
and I am trying to match a minimal block between <b>
and </b>
which contains 'text' anywhere in between. This code is the best I could come up with, but it matches:
<b>1234</b><b>56text78</b>
while I need:
<b>56text78</b>
Upvotes: 1
Views: 80
Reputation: 639
Why doesn't <b>.*?text
produce the desired output?
This is what regexp engine does:
<
, and
finds it in the string, then takes the second, then the third, until
it matches <b>
..*?text
pattern and tries to find it
in the string. That's because .*?
without the text
part would
have no sense, as it would match 0 characters. It matches
1234</b><b>56text
part and adds it to <b>
found in the step 1.It actually does produce a non-greedy output, it's just non-obvious in this case. If the string was:
`<b>1234</b><b>56text78text</b><b>9012</b>`
then the greedy '<b>.*text'
match would be:
<b>1234</b><b>56text78text
and the non-greedy one '<b>.*?text'
would produce the one I was getting:
<b>1234</b><b>56text
So to answer the the initial question, the correct solution will be to exclude the '<>' characters from the search:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>[^<>]*text.*?</b>', a).group()
Upvotes: 0
Reputation: 174844
Why you're getting the output as <b>1234</b><b>56text78</b>
when using <b>.*?text.*?</b>
regex?
Basically regex engine scans the input from left to right. So first it takes the pattern <b>
from the regex and try to match against the input string. Now the engine scans the input from left to right once it finds the tag <b>
, it matches that tag. Now the engine takes the second pattern along with the following string text
that is .*?text
. Now it matches any character upto the first text
string. Why i call it as first text
means , if there are more than one text
strings after <b>
, .*?text
matches upto the first text
string. So <b>1234</b><b>56text
will be matched. Now the engine takes the last pattern .*?</b>
and macthes upto the first </b>
, so <b>1234</b><b>56text78</b>
got matched.
When using this <b>[^<]*text[^<]*</b>
regex, it asserts that the characters before the string (text
, </b>
) and after the string (<b>
, text
) are any but not of <
character. So it prevents the engine from matching also the tags.
Upvotes: 0
Reputation: 945
instead of .*
use this
print re.search(r'<b>[^<]*text[^<]*</b>', a).group()
Here you say that ignore "<" character.
Upvotes: 2