Reputation: 372
Say I have two types of strings:
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
For both of these, I want to match 'Foobar'
(which could be anything). I have tried the following:
m = re.compile('((?<=Thing: ).+(?= Analysis))|((?<=\d ).+(?= Analysis))')
ind1 = m.search(str1).span()
match1 = str1[ind1[0]:ind1[1]]
ind2 = m.search(str2).span()
match2 = str2[ind2[0]:ind2[1]]
However, match1 comes out to 'A Thing: Foobar'
, which seems to be the match for the second pattern, not the first. Applied individually, (pattern 1 to str1
and pattern 2 to str2
, without the |
), both patterns match 'Foobar'
. I expected this, then, to stop when matched by the first pattern. This doesn't seem to be the case. What am I missing?
Upvotes: 2
Views: 361
Reputation: 2562
According to the documentation,
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
But the behavior seems to be different:
import re
THING = r'(?<=Thing: )(?P<THING>.+)(?= Analysis)'
NUM = r'(?<=\d )(?P<NUM>.+)(?= Analysis)'
MIXED = THING + '|' + NUM
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
print(re.match(THING, str1))
# <... match='Foobar'>
print(re.match(NUM, str1))
# <... match='A Thing: Foobar'>
print(re.match(MIXED, str1))
# <... match='A Thing: Foobar'>
We would expect that because THING matches 'Foobar', the MIXED pattern would get that 'Foobar' and quit searching. (as per the documentation)
Because it is not working as documented, the solution has to rely on Python's or
short-circuiting:
print(re.search(THING, str1) or re.search(NUM, str1))
# <_sre.SRE_Match object; span=(17, 23), match='Foobar'>
print(re.search(THING, str2) or re.search(NUM, str2))
# <_sre.SRE_Match object; span=(8, 14), match='Foobar'>
Upvotes: 1
Reputation: 48577
If you use named groups, eg (?P<name>...)
you'll be able to debug easier. But note the docs for span.
https://docs.python.org/2/library/re.html#re.MatchObject.span
span([group]) For MatchObject m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match.
You're not passing in the group number.
Why are you using span anyway? Just use m.search(str1).groups()
or similar
Upvotes: 0