user3150970
user3150970

Reputation: 133

Python - "in" statement search slow for list of objects

I'm hoping someone can explain why searching a list of object references is so much slower than searching a normal list. This is using the python "in" keyword to search which I thought runs at "C compiler" speed. I thought a list is just an array of object references (pointers) so the search should be extremely fast. Both lists are exactly 412236 bytes in memory.

Normal list (takes 0.000 seconds to search):

alist = ['a' for x in range(100000)]
if 'b' in alist:
    print("Found")

List of object references (takes 0.469 !! seconds to search):

class Spam:
    pass
spamlist = [Spam() for x in range(100000)]
if Spam() in spamlist:
    print("Found")


Edit: So apparently this has something to do with old-style classes having way more overhead than new style classes. My script that was bogging down with only 400 objects can now easily handle up to 10000 objects simply by making all my classes inherit from the "object" class. Just when I thought I knew Python!.

I've read about new-style vs old-style before but it was never mentioned that old-style classes can be up to 100x slower than new style ones. What is the best way to search a list of object instances for a particular instance?
1. Keep using the "in" statement but make sure all classes are new style.
2. Perform some other type of search using the "is" statement like:

[obj for obj in spamlist if obj is target]

3. Some other more Pythonic way?

Upvotes: 3

Views: 5827

Answers (4)

user2357112
user2357112

Reputation: 281121

This is mostly due to the different special method lookup mechanics of old-style classes.

>>> timeit.timeit("Spam() in l", """
... # Old-style
... class Spam: pass
... l = [Spam() for i in xrange(100000)]""", number=10)
3.0454677856675403
>>> timeit.timeit("Spam() in l", """
... # New-style
... class Spam(object): pass
... l = [Spam() for i in xrange(100000)]""", number=10)
0.05137817007346257
>>> timeit.timeit("'a' in l", 'l = ["b" for i in xrange(100000)]', number=10)
0.03013876870841159

As you can see, the version where Spam inherits from object runs much faster, almost as fast as the case with strings.

The in operator for lists uses == to compare items for equality. == is defined to try the objects' __eq__ methods, their __cmp__ methods, and pointer comparison, in that order.

For old-style classes, this is implemented in a straightforward but slow manner. Python has to actually look for the __eq__ and __cmp__ methods in each instance's dict and the dicts of each instance's class and superclasses. __coerce__ gets looked up too, as part of the 3-way compare process. When none of these methods actually exist, that's something like 12 dict lookups just to get to the pointer comparison. There's a bunch of other overhead besides the dict lookups, and I'm not actually sure which aspects of the process are the most time-consuming, but suffice it to say that the procedure is more expensive than it could be.

For built-in types and new-style classes, things are better. First, Python doesn't look for special methods on the instance's dict. This saves some dict lookups and enables the next part. Second, type objects have C-level function pointers corresponding to the Python-level special methods. When a special method is implemented in C or doesn't exist, the corresponding function pointer allows Python to skip the method lookup procedure entirely. This means that in the new-style case, Python can quickly detect that it should skip straight to the pointer comparison.

As for what you should do, I'd recommend using in and new-style classes. If you find that this operation is becoming a bottleneck, but you need old-style classes for backward compatibility, any(x is y for y in l) runs about 20 times faster than x in l:

>>> timeit.timeit('x in l', '''
... class Foo: pass
... x = Foo(); l = [Foo()] * 100000''', number=10)
2.8618816054721936
>>> timeit.timeit('any(x is y for y in l)', '''
... class Foo: pass
... x = Foo(); l = [Foo()] * 100000''', number=10)
0.12331640524583776

Upvotes: 6

James Sapam
James Sapam

Reputation: 16940

This is not the right answer for your question but this will a very good knowledge for who wants to understand how 'in' keywords works under the hood :

ceval sourcecode : ceval.c source code abstract.c sourcecode : abstract.c source code From the mail : mail about 'in' keywords

Expalantion from the mail thread:

I'm curious enough about this (OK, I admit it, I like to be right, too ;) to dig in to the details, if anyone is interested...one of the benefits of Python being open-source is you can find out how it works...

First step, look at the bytecodes:

>>> import dis
>>> def f(x, y):
...   return x in y
...
>>> dis.dis(f)
2          0 LOAD_FAST                0 (x)
           3 LOAD_FAST                1 (y)
           6 COMPARE_OP               6 (in)
           9 RETURN_VALUE

So in is implemented as a COMPARE_OP. Looking in ceval.c for COMPARE_OP, it has some optimizations for a few fast compares, then calls cmp_outcome() which, for 'in', calls PySequence_Contains().

PySequence_Contains() is implemented in abstract.c. If the container implements __contains__, that is called, otherwise _PySequence_IterSearch() is used.

_PySequence_IterSearch() calls PyObject_GetIter() to constuct an iterator on the sequence, then goes into an infinite loop (for (;;)) calling PyIter_Next() on the iterator until the item is found or the call to PyIter_Next() returns an error.

PyObject_GetIter() is also in abstract.c. If the object has an __iter__() method, that is called, otherwise PySeqIter_New() is called to construct an iterator.

PySeqIter_New() is implemented in iterobject.c. It's next() method is in iter_iternext(). This method calls __getitem__() on its wrapped object and increments an index for next time.

So, though the details are complex, I think it is pretty fair to say that the implementation uses a while loop (in _PySequence_IterSearch()) and a counter (wrapped in PySeqIter_Type) to implement 'in' on a container that defines __getitem__ but not __iter__.

By the way the implementation of 'for' also calls PyObject_GetIter(), so it uses the same mechanism to generate an iterator for a sequence that defines __getitem__().

Upvotes: 2

dawg
dawg

Reputation: 103884

Using the the test of alist = ['a' for x in range(100000)] can be very misleading because of string interning. Turns out that Python will intern (in most cases) short immutables -- especially strings -- so that they are all the same object.

Demo:

>>> alist=['a' for x in range(100000)]
>>> len(alist)
100000
>>> len({id(x) for x in alist})
1

You can see that while a list of 100000 strings is created it is only comprised of one interned object.

A more fair case would be to use a call to object to guarantee that each is a unique Python object:

>>> olist=[object() for x in range(100000)]
>>> len(olist)
100000
>>> len({id(x) for x in olist})
100000

If you compare the in operator with olist you will find the timing to be similar.

Upvotes: 0

IceArdor
IceArdor

Reputation: 2041

Python creates one immutable 'a' object and each element in a list points to the same object. Since Spam() is mutable, each instance is a different object, and dereferencing the pointers in spamlist will access many areas in RAM. The performance difference may have something to do with hardware cache hits/misses.

Obviously the performance difference would be even greater if you are including list creation time in your results (instead of just Spam() in spamlist. Also try x = Spam(); x in spamlist to see if that makes a difference.

I am curious how any(imap(equalsFunc, spamlist)) compares.

Upvotes: 0

Related Questions