user3477108
user3477108

Reputation: 897

Confused about re.compile behaviour

Please help me understand why the code in CODE2 is working. My understanding is that re.compile returns an object and we specify methods like search, match, findall etc to obtain the desired result.

What I am confused about is how re module level functions like search is able to accept the compiled abject as the parameter. Please see CODE2

CODE1:

In [430]: p = re.compile(r'(\b\w+)\s+\1')
In [431]: p.search('Paris in the the spring').group()
Out[431]:
'the the'

CODE 2

In [432]: re.search(p, 'Paris in the the spring').group()
Out[432]:
'the the'

Upvotes: 2

Views: 569

Answers (2)

Alex Martelli
Alex Martelli

Reputation: 881457

re.search's first argument is documented as being a re pattern string, i.e, not a compiled RE object -- maybe a better design would have been to accept either polymorphically, but, alas!, back when the re module was being developed, that's just not how we did it. Ah well, at least it's just the same across all module functions in re that mimic the methods of re compiled objects!-)

However, at some point in Python's long and storied history, somebody fixed our original mis-design (I can't say when that happened!): nowadays, while for most re function the first pattern argument is still said to have to be a RE pattern string, a few are now documented as "may be a string or an RE object"... and all of them appear to work that (better!) way.

So if you have a compiled re, in theory, according to (most of) the docs, you need to call its methods (and it's generally the best approach, except in the very shortest of snippets) rather than pass it to re module level functions such as re.search. But in practice, a compiled re object will be just as fine as an as-documented RE pattern.

Man, good thing I got aware of this just as I start preparing (with two co-authors) the third edition of "Python in a Nutshell"... at least I'll get to "fix the docs" for that!-)

Added: to measure speed, as usual, timeit is your friend!

$ python -mtimeit -s'import re; s="paris in the spring"; mre=re.compile("paris")' 're.match("paris", s)'
1000000 loops, best of 3: 1.38 usec per loop

versus:

$ python -mtimeit -s'import re; s="paris in the spring"; mre=re.compile("paris")' 'mre.match(s)'
1000000 loops, best of 3: 0.468 usec per loop

So, overall, you can get about 3 times faster by re.compileing things once, than by letting a re function handle everything for you. See why I'm a fan of the re.compile approach?-)

Moreover, in Python, "faster" tends to correlate strongly with "more Pythonic" ("idiomatic in Python", in other words). When you're unsure which of two approached is more Pythonic, timeit them properly (preferably using the command-line approach python -mtimeit), and if either approach is reliably faster, you have your answer: that approach is more Pythonic!-)

Upvotes: 3

user2555451
user2555451

Reputation:

All of the functions in the re module allow you to specify a pattern object instead of a pattern string. It is simply an optimization/convenience feature that allows you to avoid building a new pattern object if you already have one.


I was unable find a docs link which explicitly mentions this behavior, but you can see it quite easily if you view the source code1. To start, the implementation for re.compile is:

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a pattern object."
    return _compile(pattern, flags)

Notice how this function does nothing but call another function named _compile. This function is what actually builds a pattern object. re.compile is just an alias for it.


Moving on, the implementation for re.search is:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

As you can see, there is also nothing special about the re.search function. It just calls the search method of the pattern object returned by re._compile. This means that doing:

re.search(p, 'Paris in the the spring').group()

is the same as:

re._compile(p).search('Paris in the the spring').group()

So the question you should be asking is why re._compile allows you to pass a pattern object as a pattern string. As before, the answer can be found by looking at the implementation:

def _compile(pattern, flags):
    ...
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "Cannot process flags argument with a compiled pattern")
        return pattern

As you can see, the _compile function does a check to see if its pattern argument is already a pattern object. If so, it simply returns it and avoids building a new one. This means that doing:

re.search(p, 'Paris in the the spring').group()

is equivalent to:

re._compile(p).search('Paris in the the spring').group()

which becomes:

p.search('Paris in the the spring').group()

1This is the source code for CPython.

Upvotes: 1

Related Questions