Reputation: 361
I am presently writing a Python script to process some 10,000 or so input documents. Based on the script's progress output I notice that the first 400+ documents get processed really fast and then the script slows down although the input documents all are approximately the same size.
I am assuming this may have to do with the fact that most of the document processing is done with regexes that I do not save as regex objects once they have been compiled. Instead, I recompile the regexes whenever I need them.
Since my script has about 10 different functions all of which use about 10 - 20 different regex patterns I am wondering what would be a more efficient way in Python to avoid re-compiling the regex patterns over and over again (in Perl I could simply include a modifier //o
).
My assumption is that if I store the regex objects in the individual functions using
pattern = re.compile()
the resulting regex object will not be retained until the next invocation of the function for the next iteration (each function is called but once per document).
Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.
Any advice here on how to handle this neatly and efficiently?
Upvotes: 7
Views: 944
Reputation: 215049
In the spirit of "simple is better" I'd use a little helper function like this:
def rc(pattern, flags=0):
key = pattern, flags
if key not in rc.cache:
rc.cache[key] = re.compile(pattern, flags)
return rc.cache[key]
rc.cache = {}
Usage:
rc('[a-z]').sub...
rc('[a-z]').findall <- no compilation here
I also recommend you to try regex. Among many other advantages over the stock re, its MAXCACHE is 500 by default and won't get dropped completely on overflow.
Upvotes: 1
Reputation: 880927
The re module caches compiled regex patterns. The cache is cleared when it reaches a size of re._MAXCACHE which by default is 100. (Since you have 10 functions with 10-20 regexes each (i.e. 100-200 regexes), your observed slow-down makes sense with the clearing of the cache.)
If you are okay with changing private variables, a quick and dirty fix to your program might be to set re._MAXCACHE
to a higher value:
import re
re._MAXCACHE = 1000
Upvotes: 10
Reputation: 83032
Last time I looked, re.compile maintained a rather small cache, and when it filled up, just emptied it. DIY with no limit:
class MyRECache(object):
def __init__(self):
self.cache = {}
def compile(self, regex_string):
if regex_string not in self.cache:
self.cache[regex_string] = re.compile(regex_string)
return self.cache[regex_string]
Upvotes: 5
Reputation: 363818
Compiled regular expression are automatically cached by re.compile
, re.search
and re.match
, but the maximum cache size is 100 in Python 2.7, so you're overflowing the cache.
Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.
You can define them near the place where they are used: just before the functions that use them. If you reuse the same RE in a different place, then it would have been a good idea to define it globally anyway to avoid having to modify it in multiple places.
Upvotes: 2