polwel
polwel

Reputation: 647

Why is a compiled python regex slower?

In another SO question, the performance of regexes and Python's in operator were compared. However, the accepted answer uses re.match, which only matches the beginning of a string, and thus behaves completely different to in. Also, I wanted to see the performance gain of not recompiling the RE each time.

Surprisingly, I see that the pre-compiled version seems to be slower.

Any ideas why?

I am aware that there are quite a few other questions here that wonder about a similar issue. Most of them perform the way they do simply because they do not correctly reuse the compiled regex. If that is also my issue, please explain.

from timeit import timeit
import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

def find():
    assert text.find(pattern) > -1

def re_search():
    assert re.search(pattern, text)

def re_compiled():
    assert re.search(compiled_pattern, text)

def in_find():
    assert pattern in text

print('str.find     ', timeit(find))
print('re.search    ', timeit(re_search))
print('re (compiled)', timeit(re_compiled))
print('in           ', timeit(in_find))

Output:

str.find      0.36285957560356435
re.search     1.047689160564772
re (compiled) 1.575113873320307
in            0.1907925627077569

Upvotes: 11

Views: 3006

Answers (1)

Eric Duminil
Eric Duminil

Reputation: 54233

Short answer

If you call compiled_pattern.search(text) directly, it won't call _compile at all, it will be faster than re.search(pattern, text) and much faster than re.search(compiled_pattern, text).

This performance difference is due to KeyErrors in cache and slow hash calculations for compiled patterns.


re functions and SRE_Pattern methods

Any time a re function with a pattern as 1st argument (e.g. re.search(pattern, string) or re.findall(pattern, string)) is called, Python tries to compile the pattern first with _compile and then calls the corresponding method on the compiled pattern. For example:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

Note that pattern could be either a string or an already compiled pattern (an SRE_Pattern instance).

_compile

Here's a compact version of _compile. I simply removed debug and flags check:

_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512

def _compile(pattern, flags):
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    loc = None
    _cache[type(pattern), pattern, flags] = p, loc
    return p

_compile with String pattern

When _compile is called with a string pattern, the compiled pattern is saved in _cache dict. Next time the same function is called (e.g. during the many timeit runs), _compile simply checks in _cache if this string has already been seen and returns the corresponding compiled pattern.

Using ipdb debugger in Spyder, it's easy to dive into re.py during execution.

import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

re.search(pattern, text)
re.search(pattern, text)

With a breakpoint at the second re.search(pattern, text), it can be seen that the pair:

{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}

is saved in _cache. The compiled pattern is returned directly.

_compile with compiled pattern

slow hash

What happens if _compile is called with an already compiled pattern?

First, _compile checks if the pattern is in _cache. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:

In [1]: import re

In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
   ...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
   ...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
   ...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
   ...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
   ...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
   ...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
   ...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"

In [3]: compiled_pattern = re.compile(pattern)

In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

hash(compiled_pattern) is 60 times slower than hash(pattern) here.

KeyError

When a pattern is unknown, _cache[type(pattern), pattern, flags] fails with a KeyError.

The KeyError gets handled and ignored. Only then does _compile check if the pattern is already compiled. If it is, it gets returned, without being written in cache.

It means that the next time _compile is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a KeyError.

Error handling is expensive, and I suppose that's the main reason why re.search(compiled_pattern, text) is slower than re.search(pattern, text).

This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if _compile is called with an already compiled pattern.

Upvotes: 16

Related Questions