Reputation: 254
Here is what I mean:
> python -m timeit "set().difference(xrange(0,10))"
1000000 loops, best of 3: 0.624 usec per loop
> python -m timeit "set().difference(xrange(0,10**4))"
10000 loops, best of 3: 170 usec per loop
Apparently python iterates through the whole argument, even if the result is known to be the empty set beforehand. Is there any good reason for this? The code was run in python 2.7.6.
(Even for nonempty sets, if you find that you've removed all of the first set's elements midway through the iteration, it makes sense to stop right away.)
Upvotes: 14
Views: 2218
Reputation: 19144
When Python core developers add new features, the first priority is correct code with thorough test coverage. That is hard enough in itself. Speedups often come later as someone has the idea and inclination. I opened a tracker issue 28071 summarizing the proposal and counter-reasons discussed here. I will try to summarize its disposition here.
UPDATE: An early-out for sets that start empty has been added for 3.6.0b1, due in about a day.
Upvotes: 3
Reputation: 226326
Is there any good reason for this?
Having a special path for the empty set had not come up before.
Even for nonempty sets, if you find that you've removed all of the first set's elements midway through the iteration, it makes sense to stop right away.
This is a reasonable optimization request. I've made a patch and will apply it shortly. Here are the new timings with the patch applied:
$ py -m timeit -s "r = range(10 ** 4); s = set()" "s.difference(r)"
10000000 loops, best of 3: 0.104 usec per loop
$ py -m timeit -s "r = set(range(10 ** 4)); s = set()" "s.difference(r)"
10000000 loops, best of 3: 0.105 usec per loop
$ py -m timeit -s "r = range(10 ** 4); s = set()" "s.difference_update(r)"
10000000 loops, best of 3: 0.0659 usec per loop
$ py -m timeit -s "r = set(range(10 ** 4)); s = set()" "s.difference_update(r)"
10000000 loops, best of 3: 0.0684 usec per loop
Upvotes: 5
Reputation: 11781
IMO it's a matter of specialisation, consider:
In [18]: r = range(10 ** 4)
In [19]: s = set(range(10 ** 4))
In [20]: %time set().difference(r)
CPU times: user 387 µs, sys: 0 ns, total: 387 µs
Wall time: 394 µs
Out[20]: set()
In [21]: %time set().difference(s)
CPU times: user 10 µs, sys: 8 µs, total: 18 µs
Wall time: 16.2 µs
Out[21]: set()
Apparently difference has specialised implementation for set - set
.
Note that difference operator requires right hand argument to be a set, while difference allows any iterable.
Per @wim implementation is at https://github.com/python/cpython/blob/master/Objects/setobject.c#L1553-L1555
Upvotes: 4