Andy
Andy

Reputation: 50550

RegEx grouping not what expected

I have the following regex that should pull out 3 groups

^(ser-num.*|\[ser-num.*])(?: )?(\w+)?(?: )?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)

These two strings:

strings = [
    "ser-num recommend http://example.com/s/123456 ",
    "ser-num http://example.com/s/123456 ",
]

When I run these against the RegEx I receive the following groups:

('ser-num recommend ', None, 'http://example.com/s/123456')
('ser-num ', None, 'http://example.com/s/123456')

Why is my first result combining "recommend" into group \1 instead of \2?

This is my entire example script:

import re

p = re.compile("""^(ser-num.*|\[ser-num.*])(?: )?(\w+)?(?: )?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)""")

strings = [
    "ser-num recommend http://example.com/s/123456 ",
    "ser-num http://example.com/s/123456 ",
]

for s in strings:
    m = p.match(s)
    try:
        print m.groups()
    except AttributeError:
        print "Not a match for %s" % (s)

The explanation of my RegEx says that the optional group \2 does exist.

Update based on comments:

If I utilize this regex

^(ser-num.*|\[ser-num.*])\s?(\w*)\s?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)

I receive these results (notice the empty strings instead of None in group \2)

('ser-num recommend ', '', 'http://example.com/s/123456')
('ser-num ', '', 'http://example.com/s/123456')

Upvotes: 0

Views: 39

Answers (2)

Robᵩ
Robᵩ

Reputation: 168626

The word recommend is part of the first group because it matches the partial regexp ser-num.*. The star operator returns the longest possible match. If you want the shortest possible match, use *?.

Try this:

p = re.compile("""^(ser-num.*?|\[ser-num.*?])(?: )?(\w+)?(?: )?(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)""")

Note the use of the non-greedy star: ser-num.*?

Reference:

Upvotes: 1

abiessu
abiessu

Reputation: 1927

I suggest the following regexp:

^(\[?ser-num\S*]?)\s*(\w*)\s*(http://.*\.com/(?:s(?:erial)?|p(?:roduct)?)/\d+(?:/)?(?:\d+|(?:\w|-)+)?)

This (especially the \S* in place of the .*) forces (\w*) to be in its own capture group instead of being gobbled up by the first ser-num.* greedy any-character group. Note that you also got the extra spaces in this first group for the same reason, i.e., they were greedily captured instead of being discarded as matching optionally.

Upvotes: 2

Related Questions