Reputation: 667
For clarity, i was looking for a way to compile multiple regex at once.
For simplicity, let's say that every expression should be in the format (.*) something (.*)
.
There will be no more than 60 expressions to be tested.
As seen here, i finally wrote the following.
import re
re1 = r'(.*) is not (.*)'
re2 = r'(.*) is the same size as (.*)'
re3 = r'(.*) is a word, not (.*)'
re4 = r'(.*) is world know, not (.*)'
sentences = ["foo2 is a word, not bar2"]
for sentence in sentences:
match = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).search(sentence)
if match is not None:
print(match.group(1))
print(match.group(2))
print(match.group(3))
As regex are separated by a pipe, i thought that it will be automatically exited once a rule has been matched.
Executing the code, i have
foo2 is a word, not bar2
None
None
But by inverting re3 and re1 in re.compile match = re.compile("(%s|%s|%s|%s)" % (re3, re2, re1, re4)).search(sentence)
, i have
foo2 is a word, not bar2
foo2
bar2
As far as i can understand, first rule is executed but not the others. Can someone please point me on the right direction on this case ?
Kind regards,
Upvotes: 3
Views: 380
Reputation: 2407
Giacomo answered the question. However, I also suggest: 1) put the "compile" before the loop, 2) gather non empty groups in a list, 3) think about using (.+) instead of (.*) in re1,re2,etc.
rex= re.compile("%s|%s|%s|%s" % (re1, re2, re3, re4))
for sentence in sentences:
match = rex.search(sentence)
if match:
l=[ g for g in match.groups() if g!=None ]
print(l[0],l[1])
Upvotes: 1
Reputation: 2479
There are various issues with your example:
1
that you'd expect to reference the first group of the inner regexes. Use a non-capturing group (?:%s|%s|%s|%s)
instead.Group indexes increase even inside |
. So(?:(a)|(b)|(c))
you'd get:
>>> re.match(r'(?:(a)|(b)|(c))', 'a').groups()
('a', None, None)
>>> re.match(r'(?:(a)|(b)|(c))', 'b').groups()
(None, 'b', None)
>>> re.match(r'(?:(a)|(b)|(c))', 'c').groups()
(None, None, 'c')
It seems like you'd expect to only have one group 1 that returns either a
, b
or c
depending on the branch... no, indexes are assigned in order from left to right without taking account the grammar of the regex.
The regex
module does what you want with numbering the groups. If you want to use the built-in module you'll have to live with the fact that numbering is not the same between different branches of the regex if you use named groups:
>>> import regex
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'a').groups()
('a',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'b').groups()
('b',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'c').groups()
('c',)
(Trying to use that regex with re
will give an error for duplicated groups).
Upvotes: 2