Reputation: 11293
The following example seems to imply run time optimisation that I do not understand
Can anyone explain this behavior and how it may apply to more generic cases?
Consider the following simple (example) functions
def y(x): # str output
y = 1 if x else 0
return str(y)
def _y(x): # no str
y = 1 if x else 0
return y
Assume I want to apply the function y
upon all elements in a list
l = range(1000) # test input data
A map
operation will have to iterate through all elements in the list. It seems counter intuitive that breaking the function apart into a double map
significantly outperforms the single map
function
%timeit map(str, map(_y, l))
1000 loops, best of 3: 206 µs per loop
%timeit map(y, l)
1000 loops, best of 3: 241 µs per loop
More generically, this also applies to non standard library nested functions for example
def f(x):
return _y(_y(x))
%timeit map(_y, map(_y, l))
1000 loops, best of 3: 235 µs per loop
%timeit map(f, l)
1000 loops, best of 3: 294 µs per loop
Is this a python overhead issue where map
is compiling the low level python code where possible and consequently being throttled when it has to interpret a nested function?
Upvotes: 3
Views: 412
Reputation: 1124100
The difference is that map()
is implemented in C code, and calling to other C-implemented functions is cheap, while calling to Python code is expensive. On top of that, calling out to other callables from Python code is expensive too:
>>> timeit.timeit('f(1)', 'def f(x): return str(x)')
0.21682000160217285
>>> timeit.timeit('str(1)')
0.140916109085083
and thirdly, you are passing in the function objects to map()
(so no further lookups are done), but y()
has to look up the str
name each and every time. Global lookups are relatively expensive compared to local lookups; binding a global to a function argument to make it a local can help offset this a little:
>>> timeit.timeit('f(1)', 'def f(x, _str=str): return _str(x)')
0.19425392150878906
Much closer to the str(1)
version, even though that had to use a global too; it can still beat the function call hands-down if you gave the time test a local too:
>>> timeit.timeit('_str(1)', '_str = str')
0.10266494750976562
So, Python bytecode execution requires an additional object, a stack frame, to be created for each call. That stack frame object has to be managed on a dedicated Python call stack when calling out to other code. Moreover, your y
function looks up the str
name as a global each time, whereas the map(str, ...)
call keeps a single reference to that object and uses it over and over again.
By moving the str()
call out of the y
function and letting map()
call str()
directly instead through the single reference, you removed the stack handling and global name lookups, and sped things up marginally.
As a diagram, map(y, l)
executes, per input value:
y
, execute the body
str
as a global
y
stackframe onto stackstr(...)
whereas map(str, map(_y, l))
executes
_y
str(...)
The same applies to your f()
function setup:
>>> def f(x):
... return _y(_y(x))
...
>>> timeit.timeit("map(_y, map(_y, l))", 'from __main__ import _y, testdata as l', number=10000)
2.691640853881836
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
3.104063034057617
Calling map()
on _y
twice is faster than nesting the _y(_y(x))
call in another function, which then has to do global name look-ups and stress the Python stack some more; in your f()
example each map()
iteration has to create 3 stack frames and push and pop these on and off the stack, while in your map(_y, map(_y, ...))
setup, only 2 frames are created per iterated item:
f
, execute the body
_y
as a global
f
stackframe onto stack_y
, execute the body_y
as a global (yes, again)
f
stackframe onto stack_y
, execute the bodyversus:
_y
, execute the body
_y
, execute the body
Again, using locals can off-set the difference a little:
>>> def f(x, _y=_y):
... return _y(_y(x))
...
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
2.981696128845215
but that extra Python frame object is still hobbling the single map(f, ...)
call.
TLDR: Your y()
function suffers from O(N) additional global name lookups and O(N) additional stack frame objects being pushed onto and off the Python stack, compared to the double map()
version.
If speed matters this match, try to avoid creating Python stack frames and global name lookups in a tight loop.
Upvotes: 5