sarbjit
sarbjit

Reputation: 3894

Finding the regex which matched the string from the combined list of regex

I have a list of regex which are assigned a keyword for identification. I will be comparing a list of strings against this list of regex. If any of the pattern matches, I want to identify which regex matched in an efficient way.

regexes = {'tag1' : 'regex1', 'tag2' : 'regex2', 'tag3' : '^[a-z]+\.com'}

Option 1 :

for k,v in regexes.items():
    s = re.findall(v, "google.com")
    if len(s) != 0:
        print("Found match with tag : ", k)

Option 2

combined_regex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes.values()))
print(combined_regex.findall("google.com"))

Problem :

Option 2 would identify if any of the pattern matches. Is it also possible to know which pattern has matched from the combined regex?

Upvotes: 0

Views: 94

Answers (2)

dawg
dawg

Reputation: 103844

I would do something along these lines:

import re 

tgts=['abc', 'def', 'ghi', 'xyz']
pats={re.compile(p):t for p,t in 
    [(r'(a)|(i)', 'tag 1'), (r'(f)|(g)', 'tag 2'), (r'(g)|(d)', 'tag 3')]}

for s in tgts:
    tr=next(((s, pats[p], m.groups()) 
          for p in pats if (m:=p.search(s))), ("No Match",))
    print(tr)

Prints the first match (even if there would be more than one):

('abc', 'tag 1', ('a', None))
('def', 'tag 2', ('f', None))
('ghi', 'tag 1', (None, 'i'))
('No Match',)

If you want a list of all the matching patterns:

for s in tgts:
    tr=[(s, pats[p], m.groups()) 
        for p in pats if (m:=p.search(s))]
    if tr:
        print(tr)
    else:
        print('No matches')

Prints:

[('abc', 'tag 1', ('a', None))]
[('def', 'tag 2', ('f', None)), ('def', 'tag 3', (None, 'd'))]
[('ghi', 'tag 1', (None, 'i')), ('ghi', 'tag 2', (None, 'g')), ('ghi', 'tag 3', ('g', None))]
No matches

Upvotes: 0

Tim Peters
Tim Peters

Reputation: 70602

If you're concerned about efficiency, compile the regexps once-and-for-all, and don't use findall(). If you only care whether there's a match. then just use .search() - there's no need to build a list of all matches in that case.

I'd also invert the dict, mapping compiled regexp objects to tags instead:

import re
p2tag = {re.compile('regex1') : 'tag1',
         re.compile('regex2') : 'tag2',
         re.compile('^[a-z]+\.com') : 'tag3'}
for s in ['aregex1', 'bregex2k', 'blah.com123', 'hopeless']:
    for p in p2tag:
        if m := p.search(s):
            print(repr(s), "matched by", repr(p2tag[p]), m)
            break
    else:
        print("no match for", repr(s))

which displays:

'aregex1' matched by 'tag1' <re.Match object; span=(1, 7), match='regex1'>
'bregex2k' matched by 'tag2' <re.Match object; span=(1, 7), match='regex2'>
'blah.com123' matched by 'tag3' <re.Match object; span=(0, 8), match='blah.com'>
no match for 'hopeless'

EDIT: I'll add that there is a way to find which groups matched, and that can be abused to find which of your regexps matched when squashed into a single regexp. But you need to use capturing groups for this. Here I'll add "xxxx" as a temporary prefix for your tag names to build group names, but there's no protection against conflicts with named groups with the same names in the input regexps. Continuing from the above,

[Another edit: changed the regexp to be more reliable]

pieces = []
for (p, tag) in p2tag.items():
    pieces.append(f"(?:{p.pattern})(?P<xxxx{tag}>)")
fatre = "|".join(pieces)
print(fatre)
searcher = re.compile(fatre).search

for s in ['aregex1', 'bregex2k', 'blah.com123', 'hopeless']:
    if m := searcher(s):
        assert m.lastgroup.startswith("xxxx")
        print(repr(s), "matched by", repr(m.lastgroup[4:]))
    else:
        print("no match for", repr(s))

displays:

(?:regex1)(?P<xxxxtag1>)|(?:regex2)(?P<xxxxtag2>)|(?:^[a-z]+\.com)(?P<xxxxtag3>)
'aregex1' matched by 'tag1'
'bregex2k' matched by 'tag2'
'blah.com123' matched by 'tag3'
no match for 'hopeless'

This all builds on the .lastgroup attribute of a match object, which gives the name of the last group that matched.

I don't much like it. But, I haven't timed it, and if it turned out to be much faster in a context where that mattered, I'd use it ;-)

Upvotes: 3

Related Questions