Reputation: 552

Python Regular Expressions to NFA

I am currently using python's re module to search and capture groups. I've list of regular expressions which I have to compile and match against a large dataset which causes performance issues.

Example:

REGEXES = [
    '^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
    '^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
    '(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
    '^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
    '^(?P<title>.+?)[- ]+E(?P<epi>\d+)$'
    .
    .
    .
    .
]

Note: Regexes won't be similar

COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]

def find_match(string):
    for regex in COMPILED_REGEXES:
        match = regex.search(string)
        if not match:
            continue
        return match

Is there a way around this? The idea is to avoid iteration through the compiled regexes to get a match.

Upvotes: 10

Answers (5)

okovko

Reputation: 1911

Do any of your regexps break DFA compatibility? Doesn't look like it in your examples. You can use a Python wrapper around a C/C++ DFA implementation like re2, which is a drop in replacement for re. re2 will also fall back to using re if the regex is incompatible with the re2 syntax, so it will optimize all possible cases, and not fail on incompatible cases.

Note that re2 does support the (?P<name>regex) capture syntax, but it doesn't support the (?P=<name>) backref sytnax.

try:
    import re2 as re
    re.set_fallback_notification(re.FALLBACK_WARNING)
except ImportError:
    # latest version was for Python 2.6
else:
    import re

If you have regexps with backrefs, you can still use re2 with a few special considerations: you'll need to replace the backrefs in your regexps with .*?, and you may find false matches which you can filter out with re. In real world data, the false matches will probably be uncommon.

Here is an illustrative example:

import re
try:
    import re2
    re2.set_fallback_notification(re2.FALLBACK_WARNING)
except ImportError:
    # latest version was for Python 2.6

REGEXES = [
    '^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
    '^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
    '(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
    '^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
    '^(?P<title>.+?)[- ]+E(?P<epi>\d+)$',
]

COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
# replace all backrefs with .*? for re2 compatibility
# is there other unsupported syntax in REGEXES?
COMPILED_REGEXES_DFA = [re2.compile(re2.sub(r'\\d|\\g\\d|\\g\<\d+\>|\\g\<\w+\>', '.*?', r), flags=re2.I) for r in REGEXES]

def find_match(string):
    for regex, regex_dfa in zip(COMPILED_REGEXES, COMPILED_REGEXES_DFA):
        match_dfa = regex_dfa.search(string)
        if not match_dfa:
            continue
        match = regex.search(string)
        # most likely branch comes first for better branch prediction
        if match:
            return match

If this isn't fast enough, you can employ a variety of techniques to feed the DFA hits to re as they are processed, instead of storing them in a file or in memory and handing them off once they're all collected.

You can also combine all your regexps into one big DFA regexp of alternating groups (r1)|(r2)|(r3)| ... |(rN) and iterate through your group matches on the resulting match object to try to match only the corresponding original regexps. The match result object will have the same state as with OP's original solution.

# rename group names in regexeps to avoid name collisions
REGEXES_PREFIXED = [re2.sub(r'\(\?P\<(\w+)\>', r'(P<re{}_\1>'.format(idx), r) for idx, r in enumerate(REGEXES)]
# wrap and fold regexps (?P<hit0>pattern)| ... |(?P<hitN>pattern)
REGEX_BIG = ''
for idx, r in enumerate(REGEXES_PREFIXED):
    REGEX_BIG += '(?P<hit{}>{})|'.format(idx, r)
else:
    REGEX_BIG = REGEX_BIG[0:-1]
regex_dfa_big = re2.compile(REGEX_BIG, flags = re2.I)

def find_match(string):
    match_dfa = regex_dfa_big.search(string)
    if match_dfa:
        # only interested in hit# match groups
        hits = [n for n, _ in match_dfa.groupdict().iteritems() if re2.match(r'hit\d+', n)]
        # check for false positives
        for idx in [int(h.replace('hit', '')) for h in hits]
            match = COMPILED_REGEXES[idx].search(string)
            if match:
                return match

You can also look at pyre which is a better maintained wrapper for the same C++ library, but not a drop in replacement for re. There's also a Python Wrapper for RuRe, which is the fastest regex engine I know of.

Upvotes: 5

bitsplit

Reputation: 1060

I would look at re.Scanner. It is undocumented and flagged as experimental, but is a good example of using sre_parse and sre_compile to build a regex by parsing, merging, then compiling. If you don't care for group names, and only want to capture groups, this should work. Mind you, this code has no error checking.

import re
import sre_parse
import sre_compile


def compile_multiple(subpatterns, flags=0):
    """
    Return a compiled regex from an iterable collection of
    pattern strings so that it matches any of the patterns
    in the collection.
    """
    from sre_constants import BRANCH, SUBPATTERN
    if isinstance(flags, re.RegexFlag):
        flags = flags.value
    pattern = sre_parse.Pattern()
    pattern.flags = flags
    parsed_subpatterns = []
    for subpattern in subpatterns:
        gid = pattern.opengroup()
        parsed_subpattern = sre_parse.parse(subpattern, flags)
        parsed_subpatterns.append(sre_parse.SubPattern(pattern, [
            (SUBPATTERN, (gid, 0, 0, sre_parse.parse(subpattern, flags))),
        ]))
        pattern.closegroup(gid, parsed_subpatterns[-1])
    combined_pattern = sre_parse.SubPattern(pattern, [(BRANCH, (None, parsed_subpatterns))])
    return sre_compile.compile(combined_pattern)

Upvotes: 0

Rahul Agarwal

Reputation: 4100

I suggest do following steps:

Create an excel called Patterns.csv and have two columns in it Patterns & Name where pattern is the regex pattern like ^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$' and name can be New York. This will help you in maintaining all the regex in a separate resource other than your code. It will help you in future if you want to add/subtract/modify regexes.
Read that csv using below command:

import pandas as pd df = pd.read_csv("\\Patterns.csv")
Write code to parse this csv as below:

pattern = df['pattern'].tolist() pattern_name = df['name'].tolist() pattern_dict = dict(zip(pattern_name, pattern))
Write a pattern regex to find out all the values that are matching:

import collections sep = " ;; " NLU_Dict=collections.defaultdict() for pn, p in pattern_dict.items(): val = sep.join([sep.join(filter(lambda x: len(str(x).strip()) >0, map(str, v))) for in re.findall(p, text, re.I)]) NLU_Dict[pn] = val

Your NLU_Dict will be a dict. separated by ;; containing values of pattern names which are matched and blank for what it not matched.

Upvotes: 2

EvertW

Reputation: 1180

To elaborate my comment: the problem with putting it all in one big regexp is that group names must be unique. However, you could process your regexes as follows:

import re

REGEXES = [
    r'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
    r'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
    r'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
    r'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
    r'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$']

# Find the names of groups in the regexps
groupnames = {'RE_%s'%i:re.findall(r'\(\?P<([^>]+)>', r) for i, r in enumerate(REGEXES)}

# Convert the named groups into unnamed ones
re_list_cleaned = [re.sub(r'\?P<([^>]+)>', '', r) for r in REGEXES]

# Wrap each regexp in a named group
token_re_list = ['(?P<RE_%s>%s)'%(i, r) for i, r in enumerate(re_list_cleaned)]

# Put them all together
mighty_re = re.compile('|'.join(token_re_list), re.MULTILINE)

# Use the regexp to process a big file
with open('bigfile.txt') as f:
    txt = f.read()
for match in mighty_re.finditer(txt):
    # Now find out which regexp made the match and put the matched data in a dictionary
    re_name = match.lastgroup
    groups = [g for g in match.groups() if g is not None]
    gn = groupnames[re_name]
    matchdict = dict(zip(gn, groups[1:]))
    print ('Found:', re_name, matchdict)

Upvotes: 5

blhsing

Reputation: 106995

If all of your regex patterns follow the same format of a city name (not captured) followed by a captured series of /-delimited numbers, a colon and a space, and then a captured rest of the string, you can simply parse them all with the same regex pattern of:

def find_match(string):
    return re.search(r'(?P<grp1>\d+(?:/\d+)*): (?P<grp2>.+)', string)

Upvotes: -1

Python Regular Expressions to NFA

Answers (5)

Related Questions