Reputation: 897
Please help me understand why the code in CODE2 is working. My understanding is that re.compile
returns an object and we specify methods like search
, match
, findall
etc to obtain the desired result.
What I am confused about is how re
module level functions like search is able to accept the compiled abject as the parameter. Please see CODE2
CODE1:
In [430]: p = re.compile(r'(\b\w+)\s+\1')
In [431]: p.search('Paris in the the spring').group()
Out[431]:
'the the'
CODE 2
In [432]: re.search(p, 'Paris in the the spring').group()
Out[432]:
'the the'
Upvotes: 2
Views: 569
Reputation: 881457
re.search
's first argument is documented as being a re pattern string, i.e, not a compiled RE object -- maybe a better design would have been to accept either polymorphically, but, alas!, back when the re
module was being developed, that's just not how we did it. Ah well, at least it's just the same across all module functions in re
that mimic the methods of re
compiled objects!-)
However, at some point in Python's long and storied history, somebody fixed our original mis-design (I can't say when that happened!): nowadays, while for most re
function the first pattern
argument is still said to have to be a RE pattern string, a few are now documented as "may be a string or an RE object"... and all of them appear to work that (better!) way.
So if you have a compiled re
, in theory, according to (most of) the docs, you need to call its methods (and it's generally the best approach, except in the very shortest of snippets) rather than pass it to re
module level functions such as re.search
. But in practice, a compiled re
object will be just as fine as an as-documented RE pattern.
Man, good thing I got aware of this just as I start preparing (with two co-authors) the third edition of "Python in a Nutshell"... at least I'll get to "fix the docs" for that!-)
Added: to measure speed, as usual, timeit
is your friend!
$ python -mtimeit -s'import re; s="paris in the spring"; mre=re.compile("paris")' 're.match("paris", s)'
1000000 loops, best of 3: 1.38 usec per loop
versus:
$ python -mtimeit -s'import re; s="paris in the spring"; mre=re.compile("paris")' 'mre.match(s)'
1000000 loops, best of 3: 0.468 usec per loop
So, overall, you can get about 3 times faster by re.compile
ing things once, than by letting a re
function handle everything for you. See why I'm a fan of the re.compile
approach?-)
Moreover, in Python, "faster" tends to correlate strongly with "more Pythonic" ("idiomatic in Python", in other words). When you're unsure which of two approached is more Pythonic, timeit
them properly (preferably using the command-line approach python -mtimeit
), and if either approach is reliably faster, you have your answer: that approach is more Pythonic!-)
Upvotes: 3
Reputation:
All of the functions in the re
module allow you to specify a pattern object instead of a pattern string. It is simply an optimization/convenience feature that allows you to avoid building a new pattern object if you already have one.
I was unable find a docs link which explicitly mentions this behavior, but you can see it quite easily if you view the source code1. To start, the implementation for re.compile
is:
def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a pattern object."
return _compile(pattern, flags)
Notice how this function does nothing but call another function named _compile
. This function is what actually builds a pattern object. re.compile
is just an alias for it.
Moving on, the implementation for re.search
is:
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
As you can see, there is also nothing special about the re.search
function. It just calls the search
method of the pattern object returned by re._compile
. This means that doing:
re.search(p, 'Paris in the the spring').group()
is the same as:
re._compile(p).search('Paris in the the spring').group()
So the question you should be asking is why re._compile
allows you to pass a pattern object as a pattern string. As before, the answer can be found by looking at the implementation:
def _compile(pattern, flags):
...
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"Cannot process flags argument with a compiled pattern")
return pattern
As you can see, the _compile
function does a check to see if its pattern
argument is already a pattern object. If so, it simply returns it and avoids building a new one. This means that doing:
re.search(p, 'Paris in the the spring').group()
is equivalent to:
re._compile(p).search('Paris in the the spring').group()
which becomes:
p.search('Paris in the the spring').group()
1This is the source code for CPython.
Upvotes: 1