Reputation: 381
I've gone through the HOWTO and the re
module docs several times, and I'm still confused about how optionality and grouping interact in Python regexes. What I want is to match everything inside a group, or not at all, but I'm finding that substrings are matching. Here's a minimal example:
>> re.compile(r"(test)?").search("tes")
<_sre.SRE_MATCH at 0xBlahBlah>
I expected that not to match, since I have the entire string test
marked as optional. What (part of the docs) am I not understanding??
A version of the problem that's closer to what I'm actually interested in is as follows:
>> re.compile(r"(distance|mileage)(\sbetween)?").search("distancebetween")
<_sre.SRE_MATCH at 0xBlahblah>
Why is that whitespace not being forced to match?
EDIT 2017-01-04 The answers thus far are helpful, but I think I didn't explain my need sufficiently clearly.
In short, I want a regex that will match foo
or bar
(in their entirety) or foo baz
or bar baz
(in there entirety) and nothing else.
>> m = re.compile("(foo|bar)(\sbaz)?")
>> m.search("foo ba")
<_sre.SRE_Match as 0xBlahblah>
>> m.search("foo ba").span()
(0, 3)
So I see that what's happening is that it's matching on foo
and then not caring about what's further downstream. How do I get it to match only on baz
or nothing at all?
Upvotes: 2
Views: 454
Reputation: 647
With the ?
in both cases, you're saying you want either 0 or 1 occurances of the group. So in "(test)?"
you either match "test" with doesn't match, or an empty string, which will be the very first part of the string.
In the second one, "(distance|mileage)(\sbetween)?"
you have the four matches of "distance", "mileage" or "distance between" or "mileage between".
None of these though have to be the whole string, so there can be test before or after. Otherwise you need ^regex
if you only want the start, or regex$
to only match the end, or finally ^regex$
to only match the whole string.
Upvotes: 1
Reputation: 9066
For what you're describing I don't think you want to use an optional match. I think you want exactly the regexes you have but without the ?
.
For your first example:
>>> re.compile(r"(test)").search("tes")
>>> re.compile(r"(test)").search("test")
<_sre.SRE_Match object at 0x104c64210>
>>> re.compile(r"(test)").search("testing")
<_sre.SRE_Match object at 0x104c64198>
For your second example:
>>> re.compile(r"(distance|mileage)(\sbetween)").search("distancebetween")
>>> re.compile(r"(distance|mileage)(\sbetween)").search("distance between")
<_sre.SRE_Match object at 0x104bf5608>
>>> re.compile(r"(distance|mileage)(\sbetween)").search("distance ")
Upvotes: 1
Reputation: 8097
Let's look what is matched:
import re
m = re.compile(r"(test)?").search("tes")
m.span()
# have (0, 0)
It's empty string. Why?
Because ?
here means zero or one time (just like {0, 1}
). So the first group can match either to string test
or to empty string (which we have).
Here is a quote from the docs:
'?'
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
Upvotes: 4