Reputation: 552
I am currently using python's re
module to search and capture groups.
I've list of regular expressions which I have to compile and match against a large dataset which causes performance issues.
Example:
REGEXES = [
'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$'
.
.
.
.
]
Note: Regexes won't be similar
COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
def find_match(string):
for regex in COMPILED_REGEXES:
match = regex.search(string)
if not match:
continue
return match
Is there a way around this? The idea is to avoid iteration through the compiled regexes to get a match.
Upvotes: 10
Views: 4448
Reputation: 1911
Do any of your regexps break DFA compatibility? Doesn't look like it in your examples. You can use a Python wrapper around a C/C++ DFA implementation like re2, which is a drop in replacement for re
. re2
will also fall back to using re
if the regex is incompatible with the re2
syntax, so it will optimize all possible cases, and not fail on incompatible cases.
Note that re2
does support the (?P<name>regex)
capture syntax, but it doesn't support the (?P=<name>)
backref sytnax.
try:
import re2 as re
re.set_fallback_notification(re.FALLBACK_WARNING)
except ImportError:
# latest version was for Python 2.6
else:
import re
If you have regexps with backrefs, you can still use re2
with a few special considerations: you'll need to replace the backrefs in your regexps with .*?
, and you may find false matches which you can filter out with re
. In real world data, the false matches will probably be uncommon.
Here is an illustrative example:
import re
try:
import re2
re2.set_fallback_notification(re2.FALLBACK_WARNING)
except ImportError:
# latest version was for Python 2.6
REGEXES = [
'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$',
]
COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
# replace all backrefs with .*? for re2 compatibility
# is there other unsupported syntax in REGEXES?
COMPILED_REGEXES_DFA = [re2.compile(re2.sub(r'\\d|\\g\\d|\\g\<\d+\>|\\g\<\w+\>', '.*?', r), flags=re2.I) for r in REGEXES]
def find_match(string):
for regex, regex_dfa in zip(COMPILED_REGEXES, COMPILED_REGEXES_DFA):
match_dfa = regex_dfa.search(string)
if not match_dfa:
continue
match = regex.search(string)
# most likely branch comes first for better branch prediction
if match:
return match
If this isn't fast enough, you can employ a variety of techniques to feed the DFA hits to re
as they are processed, instead of storing them in a file or in memory and handing them off once they're all collected.
You can also combine all your regexps into one big DFA regexp of alternating groups (r1)|(r2)|(r3)| ... |(rN)
and iterate through your group matches on the resulting match object to try to match only the corresponding original regexps. The match result object will have the same state as with OP's original solution.
# rename group names in regexeps to avoid name collisions
REGEXES_PREFIXED = [re2.sub(r'\(\?P\<(\w+)\>', r'(P<re{}_\1>'.format(idx), r) for idx, r in enumerate(REGEXES)]
# wrap and fold regexps (?P<hit0>pattern)| ... |(?P<hitN>pattern)
REGEX_BIG = ''
for idx, r in enumerate(REGEXES_PREFIXED):
REGEX_BIG += '(?P<hit{}>{})|'.format(idx, r)
else:
REGEX_BIG = REGEX_BIG[0:-1]
regex_dfa_big = re2.compile(REGEX_BIG, flags = re2.I)
def find_match(string):
match_dfa = regex_dfa_big.search(string)
if match_dfa:
# only interested in hit# match groups
hits = [n for n, _ in match_dfa.groupdict().iteritems() if re2.match(r'hit\d+', n)]
# check for false positives
for idx in [int(h.replace('hit', '')) for h in hits]
match = COMPILED_REGEXES[idx].search(string)
if match:
return match
You can also look at pyre which is a better maintained wrapper for the same C++ library, but not a drop in replacement for re
. There's also a Python Wrapper for RuRe, which is the fastest regex engine I know of.
Upvotes: 5
Reputation: 1060
I would look at re.Scanner
. It is undocumented and flagged as experimental, but is a good example of using sre_parse
and sre_compile
to build a regex by parsing, merging, then compiling. If you don't care for group names, and only want to capture groups, this should work. Mind you, this code has no error checking.
import re
import sre_parse
import sre_compile
def compile_multiple(subpatterns, flags=0):
"""
Return a compiled regex from an iterable collection of
pattern strings so that it matches any of the patterns
in the collection.
"""
from sre_constants import BRANCH, SUBPATTERN
if isinstance(flags, re.RegexFlag):
flags = flags.value
pattern = sre_parse.Pattern()
pattern.flags = flags
parsed_subpatterns = []
for subpattern in subpatterns:
gid = pattern.opengroup()
parsed_subpattern = sre_parse.parse(subpattern, flags)
parsed_subpatterns.append(sre_parse.SubPattern(pattern, [
(SUBPATTERN, (gid, 0, 0, sre_parse.parse(subpattern, flags))),
]))
pattern.closegroup(gid, parsed_subpatterns[-1])
combined_pattern = sre_parse.SubPattern(pattern, [(BRANCH, (None, parsed_subpatterns))])
return sre_compile.compile(combined_pattern)
Upvotes: 0
Reputation: 4100
I suggest do following steps:
Create an excel called Patterns.csv and have two columns in it Patterns & Name where pattern is the regex pattern like ^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$'
and name can be New York
. This will help you in maintaining all the regex in a separate resource other than your code. It will help you in future if you want to add/subtract/modify regexes.
Read that csv using below command:
import pandas as pd
df = pd.read_csv("\\Patterns.csv")
Write code to parse this csv as below:
pattern = df['pattern'].tolist()
pattern_name = df['name'].tolist()
pattern_dict = dict(zip(pattern_name, pattern))
Write a pattern regex to find out all the values that are matching:
import collections
sep = " ;; "
NLU_Dict=collections.defaultdict()
for pn, p in pattern_dict.items():
val = sep.join([sep.join(filter(lambda x: len(str(x).strip()) >0, map(str, v))) for in re.findall(p, text, re.I)])
NLU_Dict[pn] = val
Your NLU_Dict
will be a dict. separated by ;;
containing values of pattern names which are matched and blank for what it not matched.
Upvotes: 2
Reputation: 1180
To elaborate my comment: the problem with putting it all in one big regexp is that group names must be unique. However, you could process your regexes as follows:
import re
REGEXES = [
r'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
r'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
r'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
r'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
r'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$']
# Find the names of groups in the regexps
groupnames = {'RE_%s'%i:re.findall(r'\(\?P<([^>]+)>', r) for i, r in enumerate(REGEXES)}
# Convert the named groups into unnamed ones
re_list_cleaned = [re.sub(r'\?P<([^>]+)>', '', r) for r in REGEXES]
# Wrap each regexp in a named group
token_re_list = ['(?P<RE_%s>%s)'%(i, r) for i, r in enumerate(re_list_cleaned)]
# Put them all together
mighty_re = re.compile('|'.join(token_re_list), re.MULTILINE)
# Use the regexp to process a big file
with open('bigfile.txt') as f:
txt = f.read()
for match in mighty_re.finditer(txt):
# Now find out which regexp made the match and put the matched data in a dictionary
re_name = match.lastgroup
groups = [g for g in match.groups() if g is not None]
gn = groupnames[re_name]
matchdict = dict(zip(gn, groups[1:]))
print ('Found:', re_name, matchdict)
Upvotes: 5
Reputation: 106995
If all of your regex patterns follow the same format of a city name (not captured) followed by a captured series of /
-delimited numbers, a colon and a space, and then a captured rest of the string, you can simply parse them all with the same regex pattern of:
def find_match(string):
return re.search(r'(?P<grp1>\d+(?:/\d+)*): (?P<grp2>.+)', string)
Upvotes: -1