ovgolovin
ovgolovin

Reputation: 13410

Understanding performance difference

Answering this question I faced an interesting situation 2 similar code snippets performed quite differently. I'm asking here just to understand the reason for that and to improve my intuition for such cases.

I'll adapt code snippets for Python 2.7 (In Python 3 the performance differences are the same).

from collections import OrderedDict
from operator import itemgetter
from itertools import izip

items = OrderedDict([('a', 10), ('b', 9), ('c', 4), ('d', 7), ('e', 3), ('f', 0), ('g', -5), ('h', 9)])

def f1():
    return min(items, key=items.get)

def f2():
    return min(items.iteritems(), key=itemgetter(1))[0]


from timeit import Timer
N = 100000

print(Timer(stmt='f1()', setup='from __main__ import f1').timeit(number = N))
print(Timer(stmt='f2()', setup='from __main__ import f2').timeit(number = N))

Output:

0.603327797248
1.21580172899

The first solution has to make lookups in OrderedDictionary to get value for each key. Second solution just iterates through OrderedDictionary key-value pairs, which have to be packed into tuples.

The second solution is 2 times slower.

Why is that?

I recently watched this video, where Raymond Hettinger says that Python tends to reuse tuples, so no extra allocations.

So, what does this performance issue boil down to?


I want to elaborate a bit on why I'm asking.

The first solution has dictionary lookup. It implies taking key hash, then finding bin by this hash, then getting key from that bin (hopefully there will be no key collisions), and then getting value associated with that key.

The second solution just goes through all bins and yields all the keys in those bins. It goes through all the bins one-by-one without an overhead of calculation which bin to take. Yes, it has to access values associated with those keys, but the value is only one step from the key, while the first solution has to go through hash-bin-key-value chain to get the value when it needs it. Each solution has to get the value, the first one gets it through hash-bin-key-value chain, the second gets it following one more pointer when accessing key. The only overhead of the second solution is it has to store this value in the tuple temporary along with the key. It turns out that this storing is the major reason for the overhead. Still I don't fully understand why is it so, given the fact there is so-called "tuple reuse" (see the video mentioned above).

To my mind, the second solution has to save value along with the key, but it avoid us of having to make hash-bin-key calculation and access to get value for that key.

Upvotes: 8

Views: 500

Answers (4)

nymk
nymk

Reputation: 3393

The performance differences is mainly caused by OrderedDict.
OrderedDict uses dict's get and __getitem__, but redefined its own __iter__ and iteritems.


    def __iter__(self):
        'od.__iter__()  iter(od)'
        # Traverse the linked list in order.
        root = self.__root
        curr = root[1]                                  # start at the first node
        while curr is not root:
            yield curr[2]                               # yield the curr[KEY]
            curr = curr[1]                              # move to next node

    def iteritems(self):
        'od.iteritems -> an iterator over the (key, value) pairs in od'
        for k in self:
            yield (k, self[k])

Look at what we found: self[k].
Your second solution did not help us avoid the hash-bin-key calculation. While the iterator generated by dict, more precisely, items.iteritems().next() if items is a dict, does not make that calculation.

Moreover, iteritems is also more expensive.

from timeit import Timer
N = 1000

d = {i:i for i in range(10000)}

def f1():
    for k in d: pass

def f2():
    for k in d.iterkeys(): pass

def f3():
    for v in d.itervalues(): pass

def f4():
    for t in d.iteritems(): pass

print(Timer(stmt='f1()', setup='from __main__ import f1').timeit(number=N))
print(Timer(stmt='f2()', setup='from __main__ import f2').timeit(number=N))
print(Timer(stmt='f3()', setup='from __main__ import f3').timeit(number=N))
print(Timer(stmt='f4()', setup='from __main__ import f4').timeit(number=N))

Output

0.256800375467
0.265079360645
0.260599391822
0.492333103788

Comparing to iterkeys' dictiter_iternextkey and itervalues' dictiter_iternextvalue, iteritems'dictiter_iternextitem has additional parts.


    if (result->ob_refcnt == 1) {
        Py_INCREF(result);
        Py_DECREF(PyTuple_GET_ITEM(result, 0));
        Py_DECREF(PyTuple_GET_ITEM(result, 1));
    } else {
        result = PyTuple_New(2);
        if (result == NULL)
            return NULL;
    }
    di->len--;
    key = ep[i].me_key;
    value = ep[i].me_value;
    Py_INCREF(key);
    Py_INCREF(value);
    PyTuple_SET_ITEM(result, 0, key);
    PyTuple_SET_ITEM(result, 1, value);

I think that tuple creation could decrease the performance.

Python indeed tends to reuse tuples.
tupleobject.c shows

/* Speed optimization to avoid frequent malloc/free of small tuples */
#ifndef PyTuple_MAXSAVESIZE
#define PyTuple_MAXSAVESIZE     20  /* Largest tuple to save on free list */
#endif
#ifndef PyTuple_MAXFREELIST
#define PyTuple_MAXFREELIST  2000  /* Maximum number of tuples of each size to save */
#endif

This optimization just means Python does not build some tuples from scratch. But there is still a lot of work to be done.


Case: dict

If OrderedDict is replaced by dict, I think the second solution is slightly better in general.
Python dictionaries are implemented using hash tables. So the lookup is fast. The average time complexity of lookup is O(1), while the worst is O(n)1. The average time complexity of your first solution is the same as the time complexity of your second solution. They are both O(n). Therefore, the second solution has no advantages or is even slower sometimes, especially when the input data is small. In that case, the extra cost caused by iteritems could not be compensated.

from collections import OrderedDict
from operator import itemgetter
from timeit import Timer
from random import randint, random

N = 100000
xs = [('a', 10), ('b', 9), ('c', 4), ('d', 7), ('e', 3), ('f', 0), ('g', -5), ('h', 9)]

od = OrderedDict(xs)
d = dict(xs)

def f1od_min():
    return min(od, key=od.get)

def f2od_min():
    return min(od.iteritems(), key=itemgetter(1))[0]

def f1d_min():
    return min(d, key=d.get)

def f2d_min():
    return min(d.iteritems(), key=itemgetter(1))[0]

def f1od():
    for k in od: pass

def f2od():
    for t in od.iteritems(): pass

def f1d():
    for k in d: pass

def f2d():
    for t in d.iteritems(): pass

print 'min'
print(Timer(stmt='f1od_min()', setup='from __main__ import f1od_min').timeit(number=N))
print(Timer(stmt='f2od_min()', setup='from __main__ import f2od_min').timeit(number=N))
print(Timer(stmt='f1d_min()', setup='from __main__ import f1d_min').timeit(number=N))
print(Timer(stmt='f2d_min()', setup='from __main__ import f2d_min').timeit(number=N))
print
print 'traverse'
print(Timer(stmt='f1od()', setup='from __main__ import f1od').timeit(number=N))
print(Timer(stmt='f2od()', setup='from __main__ import f2od').timeit(number=N))
print(Timer(stmt='f1d()', setup='from __main__ import f1d').timeit(number=N))
print(Timer(stmt='f2d()', setup='from __main__ import f2d').timeit(number=N))

Output

min
0.398274431527
0.813040903243
0.185168156847
0.249574387248    <-- dict/the second solution

traverse
0.251634216081
0.642283865687
0.0565099754298
0.0958057518483

Then replace N and xs by

N = 50
xs = [(x, randint(1, 100)) for x in range(100000)]

Output

min
1.5148923257
3.47020082161
0.712828585756
0.70823812803    <-- dict/the second solution

traverse
0.975989336634
2.92283956481
0.127676073356
0.253622387762

Now replace N and xs by

N = 10
xs = [(random(), random()) for x in range(1000000)]

Output

min
6.23311265817
10.702984667
4.32852708934
2.87853889251    <-- dict/the second solution

traverse
2.06231783648
9.49360449443
1.33297618831
1.73723008092

Finally, the second solution starts to shine.


The worse case for the first solution: hash collision

Let

N = 10000
xs = [(2 ** (32 + x) - 2 ** x + 1, 1) for x in range(100)]
# hash(2 ** (32 + x) - 2 ** x + 1) is always 1

Output

min
2.44175265292    <-- lookup is slow
2.76424538594    <-- lookup is slow
2.26508627493    <-- lookup is slow
0.199363955475

traverse
0.200654482623
2.59635966303    <-- lookup is slow
0.0454684184722
0.0733798569371

1 The Average Case times listed for dict objects assume that the hash function for the objects is sufficiently robust to make collisions uncommon. The Average Case assumes the keys used in parameters are selected uniformly at random from the set of all keys. See TimeComplexity.

Upvotes: 6

Inbar Rose
Inbar Rose

Reputation: 43437

To expand on my previous answer. To get a better view of what is happening, you can always use the dis module.

>>> import dis
>>> dis.dis(f1)
              0 LOAD_GLOBAL              0 (min)
              3 LOAD_GLOBAL              1 (items)
              6 LOAD_CONST               1 ('key')
              9 LOAD_GLOBAL              1 (items)
             12 LOAD_ATTR                2 (get)
             15 CALL_FUNCTION          257
             18 RETURN_VALUE    
>>> dis.dis(f2)
              0 LOAD_GLOBAL              0 (min)
              3 LOAD_GLOBAL              1 (items)
              6 LOAD_ATTR                2 (iteritems)
              9 CALL_FUNCTION            0
             12 LOAD_CONST               1 ('key')
             15 LOAD_GLOBAL              3 (itemgetter)
             18 LOAD_CONST               2 (1)
             21 CALL_FUNCTION            1
             24 CALL_FUNCTION          257
             27 LOAD_CONST               3 (0)
             30 BINARY_SUBSCR       
             31 RETURN_VALUE   

As you can see, there is a lot more happening in f2 (and therefore it would stand to reason that it is slower)

You can always use the dis module to check anything in Python, it usually gives a great indication of what will perform better.


One can always use the timeit module to check the timing or performance of certain methods or functions as they perform on specific types of input, but sometimes the timing can be off because the sample set of data being used is more suited to a certain function than another, for instance, when checking a sorting function, a list that is most sorted already would favor a certain type of function over another which may be faster to sort a less sorted list, but none of these take into account different types of data that might be inside the lists themselves. Using the dis module avoids most of that by being able to see directly what Python is doing behind the curtain (or under the hood) which gives a much clearer indicator on which method might be best suited to perform certain tasks

Upvotes: 1

zhangyangyu
zhangyangyu

Reputation: 8610

For tuple reuse, I don't believe it:

>>> a = (1,2)
>>> b = (1,2)
>>> id(a)
139912909456232
>>> id(b)
139912909456304
>>> 

You can see from int or string:

>>> a = 1
>>> b = 1
>>> id(a)
34961336
>>> id(b)
34961336
>>> 
>>> a = 'a'
>>> b = 'a'
>>> id(a)
139912910202240
>>> id(b)
139912910202240
>>> 

edit:

For dict, your two methods are similar. Let's try:

>>> a = {'a':1, 'b':2, 'c':3}
>>> N = 100000
# really quick to use []
>>> Timer(stmt='for x in a: z = a[x]', setup='from __main__ import a').timeit(number=N)
0.0524289608001709
# use get method
>>> Timer(stmt='for x in a: z = a.get(x)', setup='from __main__ import a').timeit(number=N)
0.10028195381164551
# use iterator and []
>>> Timer(stmt='for x in a.iteritems(): z = x[1]', setup='from __main__ import a').timeit(number=N)
0.08019709587097168
# use itemgetter and iterator
>>> b = itemgetter(1)
>>> Timer(stmt='for x in a.iteritems(): z = b(x)', setup='from __main__ import a, b').timeit(number=N)
0.09941697120666504

Though the time may change, but they are accurate in general. Using iteritems and itemgetter is as quick as get.

But for OrderedDict, let's try again:

>>> a
OrderedDict([('a', 1), ('c', 3), ('b', 2)])
>>> N = 100000
#Use []
>>> Timer(stmt='for x in a: z = a[x]', setup='from __main__ import a').timeit(number=N)
0.2354598045349121
#Use get
>>> Timer(stmt='for x in a: z = a.get(x)', setup='from __main__ import a').timeit(number=N)
0.21950387954711914
#Use iterator
>>> Timer(stmt='for x in a.iteritems(): z = x[1]', setup='from __main__ import a').timeit(number=N)
0.29949188232421875
#Use iterator and itemgetter
>>> b = itemgetter(1)
>>> Timer(stmt='for x in a.iteritems(): z = b(x)', setup='from __main__ import a, b').timeit(number=N)
0.32039499282836914

You can see that, for OrderedDict, Use get and the one use iterator and itemgetter vary in time.

So, I think the time difference is because the implementation of OrderedDict. But sorry I don't know why.

Upvotes: 3

Inbar Rose
Inbar Rose

Reputation: 43437

As you yourself mentioned, there is a difference between the functions.

Where the first function iterates over a list of strings, for each string it goes to the dictionary and looks it up to get the value, then it finds the minimum and returns.

The second function iterates over tuples of string/int pairs. and then for each one it accesses the second item (the int/value) and then it finds the minimum, (which is a tuple in this case) and then it returns that results first item.

The second function is doing a lot more work, on objects that require a lot more processing, (tuples > strings) and then (tuples > ints) , plus the additional item retrieval.

Why are you surprised?

Upvotes: 2

Related Questions