Bytecode optimization

Question

Here are 2 simple examples. In the first example append method produces LOAD_ATTR instruction inside the cycle, in the second it only produced once and result saved in variable (ie cached). Reminder: I remember, that there extend method for this task which is much faster that this

setup = \
"""LIST = []
ANOTHER_LIST = [i for i in range(10**7)]

def appender(list, another_list):
    for elem in another_list:
        list.append(elem)

def appender_optimized(list, another_list):
    append_method = list.append
    for elem in another_list:
        append_method(elem)"""


import timeit

print(timeit.timeit("appender(LIST, ANOTHER_LIST)", setup=setup, number=10))
print(timeit.timeit("appender_optimized(LIST, ANOTHER_LIST)", setup=setup, number=10))

Results:

11.92684596051036
7.384205785584728

4.6 seconds difference (even for such a big list) is no joke - for my opinion such difference can not be counted as "micro optimization". Why Python does not do it for me? Because bytecode must be exact reflection of source code? Do compiler even optimize anything? For example,

def te():
    a = 2
    a += 1
    a += 1
    a += 1
    a += 1

produces

LOAD_FAST                0 (a)
LOAD_CONST               2 (1)
INPLACE_ADD
STORE_FAST               0 (a)

4 times instead of optimize into a += 4. Or do it optimize some famous things like producing bit shift instead of multiplying by 2? Am I misunderstand something about basic language concepts?

mgilson · Accepted Answer

Python is a dynamic language. This means that you have a lot of freedom in how you write code. Due to the crazy amounts of introspection that python exposes (which are incredibly useful BTW), many optimizations simply cannot be performed. For example, in your first example, python has no way of knowing what datatype list is going to be when you call it. I could create a really weird class:

class CrazyList(object):
    def append(self, value):
        def new_append(value):
            print "Hello world"

        self.append = new_append

Obviously this isn't useful, but I can write this and it is valid python. If I were to pass this type to your above function, the code would be different than the version where you "cache" the append function.

We could write a similar example for += (it could have side-effects that wouldn't get executed if the "compiler" optimized it away).

In order to optimize efficiently, python would have to know your types ... And for a vast majority of your code, it has no (fool-proof) way to get the type data so it doesn't even try for most optimizations.

Please note that this is a micro optimization (and a well documented one). It is useful in some cases, but in most cases it is unnecessary if you write idiomatic python. e.g. your list example is best written using the .extend method as you've noted in your post. Most of the time, if you have a loop that is tight enough for the lookup time of a method to matter in your overall program runtime, then either you should find a way to rewrite just that loop to be more efficient or even push the computation into a faster language (e.g. C). Some libraries are really good at this (numpy).

With that said, there are some optimizations that can be done safely by the "compiler" in a stage known as the "peephole optimizer". e.g. It will do some simple constant folding for you:

>>> import dis
>>> def foo():
...     a = 5 * 6
... 
>>> dis.dis(foo)
  2           0 LOAD_CONST               3 (30)
              3 STORE_FAST               0 (a)
              6 LOAD_CONST               0 (None)
              9 RETURN_VALUE

In some cases, it'll cache values for later use, or turn one type of object into another:

>>> def translate_tuple(a):
...   return a in [1, 3]
... 
>>> import dis
>>> dis.dis(translate_tuple)
  2           0 LOAD_FAST                0 (a)
              3 LOAD_CONST               3 ((1, 3))
              6 COMPARE_OP               6 (in)
              9 RETURN_VALUE

(Note the list got turned into a tuple and cached -- In python3.2+ set literals can also get turned into frozenset and cached).

Bytecode optimization

Answers (2)

Related Questions