Reputation: 50550
I have the following regex that should pull out 3 groups
^(ser-num.*|\[ser-num.*])(?: )?(\w+)?(?: )?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)
These two strings:
strings = [
"ser-num recommend http://example.com/s/123456 ",
"ser-num http://example.com/s/123456 ",
]
When I run these against the RegEx I receive the following groups:
('ser-num recommend ', None, 'http://example.com/s/123456')
('ser-num ', None, 'http://example.com/s/123456')
Why is my first result combining "recommend" into group \1
instead of \2
?
This is my entire example script:
import re
p = re.compile("""^(ser-num.*|\[ser-num.*])(?: )?(\w+)?(?: )?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)""")
strings = [
"ser-num recommend http://example.com/s/123456 ",
"ser-num http://example.com/s/123456 ",
]
for s in strings:
m = p.match(s)
try:
print m.groups()
except AttributeError:
print "Not a match for %s" % (s)
The explanation of my RegEx says that the optional group \2
does exist.
Update based on comments:
If I utilize this regex
^(ser-num.*|\[ser-num.*])\s?(\w*)\s?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)
I receive these results (notice the empty strings instead of None
in group \2
)
('ser-num recommend ', '', 'http://example.com/s/123456')
('ser-num ', '', 'http://example.com/s/123456')
Upvotes: 0
Views: 39
Reputation: 168626
The word recommend
is part of the first group because it matches the partial regexp ser-num.*
. The star operator returns the longest possible match. If you want the shortest possible match, use *?
.
Try this:
p = re.compile("""^(ser-num.*?|\[ser-num.*?])(?: )?(\w+)?(?: )?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)""")
Note the use of the non-greedy star: ser-num.*?
Reference:
*?
, +?
, ??
entry here: https://docs.python.org/2/library/re.html#regular-expression-syntaxUpvotes: 1
Reputation: 1927
I suggest the following regexp:
^(\[?ser-num\S*]?)\s*(\w*)\s*(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)
This (especially the \S*
in place of the .*
) forces (\w*)
to be in its own capture group instead of being gobbled up by the first ser-num.*
greedy any-character group. Note that you also got the extra spaces in this first group for the same reason, i.e., they were greedily captured instead of being discarded as matching optionally.
Upvotes: 2