dieggsy
dieggsy

Reputation: 372

Python regular expression with or and re.search

Say I have two types of strings:

str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'

For both of these, I want to match 'Foobar' (which could be anything). I have tried the following:

m = re.compile('((?<=Thing: ).+(?= Analysis))|((?<=\d ).+(?= Analysis))')

ind1 = m.search(str1).span()
match1 = str1[ind1[0]:ind1[1]]

ind2 = m.search(str2).span()
match2 = str2[ind2[0]:ind2[1]]

However, match1 comes out to 'A Thing: Foobar', which seems to be the match for the second pattern, not the first. Applied individually, (pattern 1 to str1 and pattern 2 to str2, without the |), both patterns match 'Foobar'. I expected this, then, to stop when matched by the first pattern. This doesn't seem to be the case. What am I missing?

Upvotes: 2

Views: 361

Answers (2)

chapelo
chapelo

Reputation: 2562

According to the documentation,

As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

But the behavior seems to be different:

import re

THING = r'(?<=Thing: )(?P<THING>.+)(?= Analysis)'
NUM = r'(?<=\d )(?P<NUM>.+)(?= Analysis)'
MIXED = THING + '|' + NUM

str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'

print(re.match(THING, str1))
# <... match='Foobar'>
print(re.match(NUM, str1))
# <... match='A Thing: Foobar'>
print(re.match(MIXED, str1))
# <... match='A Thing: Foobar'>

We would expect that because THING matches 'Foobar', the MIXED pattern would get that 'Foobar' and quit searching. (as per the documentation)

Because it is not working as documented, the solution has to rely on Python's or short-circuiting:

print(re.search(THING, str1) or re.search(NUM, str1))
# <_sre.SRE_Match object; span=(17, 23), match='Foobar'>

print(re.search(THING, str2) or re.search(NUM, str2))
# <_sre.SRE_Match object; span=(8, 14), match='Foobar'>

Upvotes: 1

Falmarri
Falmarri

Reputation: 48577

If you use named groups, eg (?P<name>...) you'll be able to debug easier. But note the docs for span.

https://docs.python.org/2/library/re.html#re.MatchObject.span

span([group]) For MatchObject m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match.

You're not passing in the group number.

Why are you using span anyway? Just use m.search(str1).groups() or similar

Upvotes: 0

Related Questions