Dave
Dave

Reputation: 3924

How to return the regex that matches some text?

The answer to Javascript regex question Return the part of the regex that matched is "No, because compilation destroys the relationship between the regex text and the matching logic."

But Python preserves Match Objects, and re.groups() returns the specific group(s) that triggered a match. It should be simple to preserve the regex text of each group as part of a Match Object and return it, but there doesn't appear to be a call to do so.

import re

pat = "(^\d+$)|(^\w+$)|(^\W+$)"
test = ['a', 'c3', '36d', '51', '29.5', '#$%&']
for t in test:
    m = re.search(pat, t)
    s = (m.lastindex, m.groups()) if m else ''
    print(str(bool(m)), s)

This returns:

True (2, (None, 'a', None))
True (2, (None, 'c3', None))
True (1, ('51', None, None))
False
True (3, (None, None, '#$%&'))

The compiler obviously knows that there are three groups in this pattern. Is there a way to extract the subpattern in each group in a regex, with something like:

>>> print(m.regex_group_text)

('^\d+$', '^\w+$', '^\W+$')

Yes, it would be possible to write a custom pattern parser, for example to split on '|' for this particular pattern. But it would be far easier and more reliable to use the re compiler's understanding of the text in each group.

Upvotes: 5

Views: 136

Answers (3)

jbndlr
jbndlr

Reputation: 5210

If the indices are not sufficient and you absolutely need to know the exact part of the regex, there is probably no other possibility but to parse the expression's groups on your own.

All in all, this is no big deal, since you can simply count opening and closing brackets and log their indices:

def locateBraces(inp):
    bracePositions = []
    braceStack = []
    depth = 0
    for i in range(len(inp)):
        if inp[i] == '(':
            braceStack.append(i)
            depth += 1
        if inp[i] == ')':
            bracePositions.append((braceStack.pop(), i))
            depth -= 1
            if depth < 0:
                raise SyntaxError('Too many closing braces.')
    if depth != 0:
        raise SyntaxError('Too many opening braces.')
    return bracePositions

Edited: This dumb implementation only counts opening and closing braces. However, regexes may contain escaped braces, e.g. \(, which are counted as regular group-defining braces using this method. You may want to adapt it to omit braces that have an uneven number of backslashes right before them. I leave this issue as a task for you ;)

With this function, your example becomes:

pat = "(^\d+$)|(^\w+$)|(^\W+$)"
bloc = locateBraces(pat)

test = ['a', 'c3', '36d', '51', '29.5', '#$%&']
for t in test:
    m = re.search(pat, t)
    print(str(bool(m)), end='')
    if m:
        h = bloc[m.lastindex - 1]
        print(' %s' % (pat[h[0]:h[1] + 1]))
    else:
        print()

Which returns:

True (^\w+$)
True (^\w+$)
True (^\w+$)
True (^\d+$)
False
True (^\W+$)

Edited: To get the list of your groups, of course a simple comprehension would do:

gtxt = [pat[b[0]:b[1] + 1] for b in bloc]

Upvotes: 5

user2926055
user2926055

Reputation: 1991

It will remain up to you to track what regular expressions you are feeding into re.search. Something like:

import re

patts = {
  'a': '\d+',
  'b': '^\w+',
  'c': '\W+'
}

pat = '^' + '|'.join('({})'.format(x) for x in patts.values()) + '$'
test = ['a', 'c3', '36d', '51', '29.5', '#$%&']
for t in test:
    m = re.search(pat, t)
    if m:
      for g in m.groups():
        for key, regex in patts.iteritems():
          if g and re.search(regex, g):
            print "t={} matched regex={} ({})".format(t, key, regex)
            break

Upvotes: 2

mgilson
mgilson

Reputation: 309841

This may or may not be helpful depending on the problem that you are actually trying to solve ... But python lets you name the groups:

r = re.compile('(?P<int>^\d+$)|(?P<word>^\w+$)')

From there, when you have a match, you can inspect the groupdict to see which groups are present:

r.match('foo').groupdict()  # {'int': None, 'word': 'foo'}
r.match('10').groupdict()  # {'int': '10', 'word': None}

Of course, this doesn't tell you the exact regular expression associated with the match -- You'd need to keep track of that yourself based on the group name.

If you really want to go beyond this, you probably want something a little more sophisticated than simple regular expression parsing. In that case, I might suggest something like pyparsing. Don't let the old-school styling on the website fool you (or the lack of a PEP-8 compliant API) -- the library is actually quite powerful once you get used to it.

Upvotes: 4

Related Questions