Tanc
Tanc

Reputation: 667

Combine multiple regex expressions in Python

For clarity, i was looking for a way to compile multiple regex at once. For simplicity, let's say that every expression should be in the format (.*) something (.*). There will be no more than 60 expressions to be tested.

As seen here, i finally wrote the following.

import re
re1 = r'(.*) is not (.*)'
re2 = r'(.*) is the same size as (.*)'
re3 = r'(.*) is a word, not (.*)'
re4 = r'(.*) is world know, not (.*)'

sentences = ["foo2 is a word, not bar2"]

for sentence in sentences:
    match = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).search(sentence)
    if match is not None:
        print(match.group(1))
        print(match.group(2))
        print(match.group(3))

As regex are separated by a pipe, i thought that it will be automatically exited once a rule has been matched.

Executing the code, i have

foo2 is a word, not bar2
None
None

But by inverting re3 and re1 in re.compile match = re.compile("(%s|%s|%s|%s)" % (re3, re2, re1, re4)).search(sentence), i have

foo2 is a word, not bar2
foo2
bar2

As far as i can understand, first rule is executed but not the others. Can someone please point me on the right direction on this case ?

Kind regards,

Upvotes: 3

Views: 380

Answers (2)

kantal
kantal

Reputation: 2407

Giacomo answered the question. However, I also suggest: 1) put the "compile" before the loop, 2) gather non empty groups in a list, 3) think about using (.+) instead of (.*) in re1,re2,etc.

    rex= re.compile("%s|%s|%s|%s" % (re1, re2, re3, re4))
    for sentence in sentences:
        match = rex.search(sentence)
        if match:
            l=[ g for g in match.groups() if g!=None ]
            print(l[0],l[1])

Upvotes: 1

Giacomo Alzetta
Giacomo Alzetta

Reputation: 2479

There are various issues with your example:

  1. You are using a capturing group, so it gets the index 1 that you'd expect to reference the first group of the inner regexes. Use a non-capturing group (?:%s|%s|%s|%s) instead.
  2. Group indexes increase even inside |. So(?:(a)|(b)|(c)) you'd get:

    >>> re.match(r'(?:(a)|(b)|(c))', 'a').groups()
    ('a', None, None)
    >>> re.match(r'(?:(a)|(b)|(c))', 'b').groups()
    (None, 'b', None)
    >>> re.match(r'(?:(a)|(b)|(c))', 'c').groups()
    (None, None, 'c')
    

    It seems like you'd expect to only have one group 1 that returns either a, b or c depending on the branch... no, indexes are assigned in order from left to right without taking account the grammar of the regex.

The regex module does what you want with numbering the groups. If you want to use the built-in module you'll have to live with the fact that numbering is not the same between different branches of the regex if you use named groups:

>>> import regex
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'a').groups()
('a',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'b').groups()
('b',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'c').groups()
('c',)

(Trying to use that regex with re will give an error for duplicated groups).

Upvotes: 2

Related Questions