Reputation: 361

Pythonic and efficient way of defining multiple regexes for use over many iterations

I am presently writing a Python script to process some 10,000 or so input documents. Based on the script's progress output I notice that the first 400+ documents get processed really fast and then the script slows down although the input documents all are approximately the same size.

I am assuming this may have to do with the fact that most of the document processing is done with regexes that I do not save as regex objects once they have been compiled. Instead, I recompile the regexes whenever I need them.

Since my script has about 10 different functions all of which use about 10 - 20 different regex patterns I am wondering what would be a more efficient way in Python to avoid re-compiling the regex patterns over and over again (in Perl I could simply include a modifier //o).

My assumption is that if I store the regex objects in the individual functions using

pattern = re.compile()

the resulting regex object will not be retained until the next invocation of the function for the next iteration (each function is called but once per document).

Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.

Any advice here on how to handle this neatly and efficiently?

Upvotes: 7

Answers (4)

georg

Reputation: 215049

In the spirit of "simple is better" I'd use a little helper function like this:

def rc(pattern, flags=0):
    key = pattern, flags
    if key not in rc.cache:
        rc.cache[key] = re.compile(pattern, flags)
    return rc.cache[key]

rc.cache = {}

Usage:

rc('[a-z]').sub...
rc('[a-z]').findall <- no compilation here

I also recommend you to try regex. Among many other advantages over the stock re, its MAXCACHE is 500 by default and won't get dropped completely on overflow.

Upvotes: 1

unutbu

Reputation: 880927

The re module caches compiled regex patterns. The cache is cleared when it reaches a size of re._MAXCACHE which by default is 100. (Since you have 10 functions with 10-20 regexes each (i.e. 100-200 regexes), your observed slow-down makes sense with the clearing of the cache.)

If you are okay with changing private variables, a quick and dirty fix to your program might be to set re._MAXCACHE to a higher value:

import re
re._MAXCACHE = 1000

Upvotes: 10

John Machin

Reputation: 83032

Last time I looked, re.compile maintained a rather small cache, and when it filled up, just emptied it. DIY with no limit:

class MyRECache(object):
    def __init__(self):
        self.cache = {}
    def compile(self, regex_string):
        if regex_string not in self.cache:
            self.cache[regex_string] = re.compile(regex_string)
        return self.cache[regex_string]

Upvotes: 5

Fred Foo

Reputation: 363818

Compiled regular expression are automatically cached by re.compile, re.search and re.match, but the maximum cache size is 100 in Python 2.7, so you're overflowing the cache.

Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.

You can define them near the place where they are used: just before the functions that use them. If you reuse the same RE in a different place, then it would have been a good idea to define it globally anyway to avoid having to modify it in multiple places.

Upvotes: 2

Pythonic and efficient way of defining multiple regexes for use over many iterations

Answers (4)

Related Questions