( Update: Might only happen in CPython 3.8 32-bit for Windows, so don't be surprised if you can't reproduce it in other versions. See tables in the Update section.) Both iter and reversed result in specialized iterators for lists: >>> iter([1, 2, 3]) <list_iterator object at 0x031495C8> >>> reversed([1, 2, 3]) <list_reverseiterator object at 0x03168310> But the reversed one is much slower: > python -m timeit -s "a = list(range(1000))" "list(iter(a))" 50000 loops, best of 5: 5.76 usec per loop > python -m timeit -s "a = list(range(1000))" "list(reversed(a))" 20000 loops, best of 5: 14.2 usec per loop And I can consistently reproduce it. I later tried iter five more times, took 5.98, 5.84, 5.85, 5.87, 5.86. Then reversed five more times, took 14.3, 14.4, 14.4, 14.5, 14.3. I thought maybe iter benefits from increasing memory locations of the list's elements, so I tried reversing the list beforehand. Same picture: > python -m timeit -s "a = list(range(1000)); a.reverse()" "list(iter(a))" 50000 loops, best of 5: 5.73 usec per loop > python -m timeit -s "a = list(range(1000)); a.reverse()" "list(reversed(a))" 20000 loops, best of 5: 14.1 usec per loop Same picture with sum as well: > python -m timeit -s "a = list(range(1000))" "sum(iter(a))" 20000 loops, best of 5: 10.7 usec per loop > python -m timeit -s "a = list(range(1000))" "sum(reversed(a))" 10000 loops, best of 5: 20.9 usec per loop And with identical elements, too: > python -m timeit -s "a = [None] * 1000" "list(iter(a))" 50000 loops, best of 5: 6.35 usec per loop > python -m timeit -s "a = [None] * 1000" "list(reversed(a))" 20000 loops, best of 5: 14.5 usec per loop Why is the reverse iterator so much slower? I'm using CPython 3.8.1 32 bit on Windows 10 pro 64 bit version 1903 with an Intel i5-7200U (it's a HUAWEI MateBook X). No special configuration, just a normal Python install on a normal Windows install. Update: I ran a larger automated test with eight different Python versions (all freshly installed with default settings) on another machine (Pentium N3700, Windows 10 Pro 64-bit 1903). Times in usec per loop: 32-bit 64-bit CPython iter reversed iter reversed 3.5.4 16.6 17.0 15.2 16.2 3.6.8 16.8 17.2 14.9 15.8 3.7.6 16.5 16.9 14.8 15.5 3.8.1 16.3 22.1 14.6 15.5 Two things to note: Python 3.8.1 32-bit reversed is the only one much slower. Might explain why almost nobody else could reproduce it. In all seven other versions, reversed was a bit slower than iter . About 0.4 usec in 32-bit and about 0.9 usec in 64-bit. I ran those 16 tests in Round-robin fashion for ten rounds, and each time shown above is the best of its ten source times. Each of the 160 source times was done like this: python.exe -m timeit -r 5 -s "a = list(range(1000))" "list(iter(a))" or python.exe -m timeit -r 5 -s "a = list(range(1000))" "list(reversed(a))" The times for each of the 16 tests were pretty consistent. Full table (note that the round-robin means I ran these column by column, not row by row): 3.5.4 32-bit iter [16.7, 16.6, 17.3, 16.6, 16.7, 16.6, 16.6, 16.6, 16.6, 16.7] 3.5.4 32-bit reversed [17.1, 17.1, 17.1, 17.2, 17.1, 17.1, 17.0, 17.1, 17.1, 17.1] 3.5.4 64-bit iter [15.2, 15.4, 15.4, 15.4, 15.4, 15.4, 15.4, 15.3, 15.4, 15.3] 3.5.4 64-bit reversed [16.8, 16.2, 16.3, 16.3, 16.2, 16.2, 16.2, 16.2, 16.2, 16.3] 3.6.8 32-bit iter [17.3, 16.9, 16.8, 16.9, 16.9, 16.8, 16.9, 16.9, 16.8, 16.8] 3.6.8 32-bit reversed [17.2, 17.2, 17.2, 17.3, 17.3, 17.3, 17.3, 17.2, 17.2, 17.2] 3.6.8 64-bit iter [15.0, 14.9, 15.9, 14.9, 14.9, 15.0, 14.9, 14.9, 14.9, 14.9] 3.6.8 64-bit reversed [15.8, 15.9, 16.4, 15.9, 15.9, 16.0, 15.8, 15.9, 15.9, 15.8] 3.7.6 32-bit iter [16.6, 17.2, 16.6, 16.5, 16.7, 16.7, 16.5, 16.5, 16.5, 16.7] 3.7.6 32-bit reversed [17.2, 17.6, 17.0, 17.0, 16.9, 17.2, 17.3, 17.0, 17.5, 17.0] 3.7.6 64-bit iter [14.8, 15.1, 14.9, 14.9, 14.8, 15.1, 14.9, 14.8, 15.0, 14.9] 3.7.6 64-bit reversed [16.0, 20.1, 15.7, 15.6, 15.6, 15.6, 15.7, 15.7, 15.8, 15.5] 3.8.1 32-bit iter [16.4, 16.6, 16.3, 16.4, 16.5, 16.4, 16.5, 16.4, 16.8, 16.4] 3.8.1 32-bit reversed [22.3, 22.4, 22.2, 22.3, 22.3, 22.3, 22.5, 22.4, 22.3, 22.1] 3.8.1 64-bit iter [14.6, 15.1, 14.6, 14.7, 14.7, 14.7, 14.7, 14.6, 14.6, 14.6] 3.8.1 64-bit reversed [15.5, 16.1, 15.5, 15.6, 15.5, 15.5, 15.5, 15.5, 15.5, 15.5] The same test on a list with a million values ( list(range(250)) * 4000 ). Times are msec per loop: 32-bit 64-bit CPython iter reversed iter reversed 3.5.4 19.8 19.9 22.4 22.7 3.6.8 19.8 19.9 22.3 22.6 3.7.6 19.9 19.9 22.3 22.5 3.8.1 19.8 24.9 22.4 22.6 The variation is even smaller, except reversed on 3.8.1 32-bit is much slower again. One more, just with CPython 3.8.0 instead of 3.8.1, where it also happens. 32-bit 64-bit CPython iter reversed iter reversed 3.5.4 19.5 19.6 21.9 22.2 3.6.8 19.5 19.7 21.8 22.1 3.7.6 19.5 19.6 21.7 22.0 3.8.0 19.4 24.5 21.7 22.1

Reputation: 27588

Why is reversed(mylist) so slow?

(Update: Might only happen in CPython 3.8 32-bit for Windows, so don't be surprised if you can't reproduce it in other versions. See tables in the Update section.)

Both iter and reversed result in specialized iterators for lists:

>>> iter([1, 2, 3])
<list_iterator object at 0x031495C8>

>>> reversed([1, 2, 3])
<list_reverseiterator object at 0x03168310>

But the reversed one is much slower:

> python -m timeit -s "a = list(range(1000))" "list(iter(a))"
50000 loops, best of 5: 5.76 usec per loop

> python -m timeit -s "a = list(range(1000))" "list(reversed(a))"
20000 loops, best of 5: 14.2 usec per loop

And I can consistently reproduce it. I later tried iter five more times, took 5.98, 5.84, 5.85, 5.87, 5.86. Then reversed five more times, took 14.3, 14.4, 14.4, 14.5, 14.3.

I thought maybe iter benefits from increasing memory locations of the list's elements, so I tried reversing the list beforehand. Same picture:

> python -m timeit -s "a = list(range(1000)); a.reverse()" "list(iter(a))"
50000 loops, best of 5: 5.73 usec per loop

> python -m timeit -s "a = list(range(1000)); a.reverse()" "list(reversed(a))"
20000 loops, best of 5: 14.1 usec per loop

Same picture with sum as well:

> python -m timeit -s "a = list(range(1000))" "sum(iter(a))"
20000 loops, best of 5: 10.7 usec per loop

> python -m timeit -s "a = list(range(1000))" "sum(reversed(a))"
10000 loops, best of 5: 20.9 usec per loop

And with identical elements, too:

> python -m timeit -s "a = [None] * 1000" "list(iter(a))"
50000 loops, best of 5: 6.35 usec per loop

> python -m timeit -s "a = [None] * 1000" "list(reversed(a))"
20000 loops, best of 5: 14.5 usec per loop

Why is the reverse iterator so much slower?

I'm using CPython 3.8.1 32 bit on Windows 10 pro 64 bit version 1903 with an Intel i5-7200U (it's a HUAWEI MateBook X). No special configuration, just a normal Python install on a normal Windows install.

Update: I ran a larger automated test with eight different Python versions (all freshly installed with default settings) on another machine (Pentium N3700, Windows 10 Pro 64-bit 1903). Times in usec per loop:

                32-bit              64-bit
CPython     iter   reversed     iter   reversed
 3.5.4      16.6     17.0       15.2     16.2
 3.6.8      16.8     17.2       14.9     15.8
 3.7.6      16.5     16.9       14.8     15.5
 3.8.1      16.3     22.1       14.6     15.5

Two things to note:

Python 3.8.1 32-bit reversed is the only one much slower. Might explain why almost nobody else could reproduce it.
In all seven other versions, reversed was a bit slower than iter. About 0.4 usec in 32-bit and about 0.9 usec in 64-bit.

I ran those 16 tests in Round-robin fashion for ten rounds, and each time shown above is the best of its ten source times. Each of the 160 source times was done like this:

python.exe -m timeit -r 5 -s "a = list(range(1000))" "list(iter(a))"
or
python.exe -m timeit -r 5 -s "a = list(range(1000))" "list(reversed(a))"

The times for each of the 16 tests were pretty consistent. Full table (note that the round-robin means I ran these column by column, not row by row):

3.5.4 32-bit iter     [16.7, 16.6, 17.3, 16.6, 16.7, 16.6, 16.6, 16.6, 16.6, 16.7]
3.5.4 32-bit reversed [17.1, 17.1, 17.1, 17.2, 17.1, 17.1, 17.0, 17.1, 17.1, 17.1]
3.5.4 64-bit iter     [15.2, 15.4, 15.4, 15.4, 15.4, 15.4, 15.4, 15.3, 15.4, 15.3]
3.5.4 64-bit reversed [16.8, 16.2, 16.3, 16.3, 16.2, 16.2, 16.2, 16.2, 16.2, 16.3]
3.6.8 32-bit iter     [17.3, 16.9, 16.8, 16.9, 16.9, 16.8, 16.9, 16.9, 16.8, 16.8]
3.6.8 32-bit reversed [17.2, 17.2, 17.2, 17.3, 17.3, 17.3, 17.3, 17.2, 17.2, 17.2]
3.6.8 64-bit iter     [15.0, 14.9, 15.9, 14.9, 14.9, 15.0, 14.9, 14.9, 14.9, 14.9]
3.6.8 64-bit reversed [15.8, 15.9, 16.4, 15.9, 15.9, 16.0, 15.8, 15.9, 15.9, 15.8]
3.7.6 32-bit iter     [16.6, 17.2, 16.6, 16.5, 16.7, 16.7, 16.5, 16.5, 16.5, 16.7]
3.7.6 32-bit reversed [17.2, 17.6, 17.0, 17.0, 16.9, 17.2, 17.3, 17.0, 17.5, 17.0]
3.7.6 64-bit iter     [14.8, 15.1, 14.9, 14.9, 14.8, 15.1, 14.9, 14.8, 15.0, 14.9]
3.7.6 64-bit reversed [16.0, 20.1, 15.7, 15.6, 15.6, 15.6, 15.7, 15.7, 15.8, 15.5]
3.8.1 32-bit iter     [16.4, 16.6, 16.3, 16.4, 16.5, 16.4, 16.5, 16.4, 16.8, 16.4]
3.8.1 32-bit reversed [22.3, 22.4, 22.2, 22.3, 22.3, 22.3, 22.5, 22.4, 22.3, 22.1]
3.8.1 64-bit iter     [14.6, 15.1, 14.6, 14.7, 14.7, 14.7, 14.7, 14.6, 14.6, 14.6]
3.8.1 64-bit reversed [15.5, 16.1, 15.5, 15.6, 15.5, 15.5, 15.5, 15.5, 15.5, 15.5]

The same test on a list with a million values (list(range(250)) * 4000). Times are msec per loop:

                32-bit              64-bit
CPython     iter   reversed     iter   reversed
 3.5.4      19.8     19.9       22.4     22.7
 3.6.8      19.8     19.9       22.3     22.6
 3.7.6      19.9     19.9       22.3     22.5
 3.8.1      19.8     24.9       22.4     22.6

The variation is even smaller, except reversed on 3.8.1 32-bit is much slower again.

One more, just with CPython 3.8.0 instead of 3.8.1, where it also happens.

                32-bit              64-bit
CPython     iter   reversed     iter   reversed
 3.5.4      19.5     19.6       21.9     22.2
 3.6.8      19.5     19.7       21.8     22.1
 3.7.6      19.5     19.6       21.7     22.0
 3.8.0      19.4     24.5       21.7     22.1

Upvotes: 19

Answers (2)

Raymond Hettinger

Reputation: 226256

Unable to reproduce a significant difference

I've tried this on Python 3.9 and Python 3.11 using the stock macOS builds from python.org. The timings show that forward and reverse iteration run at close to the same speed:

% python3.11 -m timeit -s "a = list(range(1000))" "list(iter(a))"
100000 loops, best of 5: 2.84 usec per loop
% python3.11 -m timeit -s "a = list(range(1000))" "list(reversed(a))"
100000 loops, best of 5: 2.85 usec per loop

% python3.9 -m timeit -s "a = list(range(1000))" "list(iter(a))"
100000 loops, best of 5: 2.87 usec per loop
% python3.9 -m timeit -s "a = list(range(1000))" "list(reversed(a))"
100000 loops, best of 5: 2.91 usec per loop

Source code

Examining the C source shows that the code for list_iterator and list_reverseiterator are almost identical.

The expectation is that the reversed iterator will be very slightly slower for two reasons:

The loop termination logic for reversed has two conditions index>=0 && index < PyList_GET_SIZE(seq) instead of the one condition for forward iteration it->it_index < PyList_GET_SIZE(seq).
The memory access controllers on some CPUs will issue automatic prefetches when traversing an array in a forward direction but not in reverse. See §12.1 Optimizing Caching in Agner Fog's "Optimizing subroutines in assembly" language.

Sources of Variability

Given the source code is almost identical, the cause of variability is likely due to the compilation and build process.

Perhaps the Windows compiler can better optimize forward iteration with its simpler loop termination condition.
A more likely cause is that the Windows build is using Profile Guided Optimization (PGO) to inform the code generation step. This is very sensitive to how the profile is built. If the profiled code made little or no use of the reverse iterator, then it will not benefit from PGO.

The usual recommendation here is to rerun PGO using code that you care about rather than just the test suite. That will tune the build to your specific needs.

Better benchmark

Consecutive integers created by range() tend to be small objects and laid out in consecutive memory locations.

Also, Tte timing for a small number of iterations tends to also include the overhead of the calls to list, iter, and reversed.

For a better benchmark, I would use a larger, well shuffled dataset consisting of more interesting objects. Also, I would skip including list() in the loop so the timing can focus just on the iteration rather than on the list building.

from timeit import repeat

setup = '''
from random import randrange
from random import randrange, shuffle
data = [str(randrange(100_000)) for i in range(10_000)]
shuffle(data)
'''

forward = 'for x in iter(data): pass'

backward = 'for x in reversed(data): pass'

Here are two runs:

>>> min(repeat(forward, setup, repeat=7, number=100_000))
4.0327267500106245

>>> min(repeat(backward, setup, repeat=7, number=100_000))
4.094246666994877

Upvotes: 3

David Garcia Ricou

Reputation: 40

I've been watching on python documentation and found this:

If the __reversed__() method is not provided, the reversed() built-in will fall back to using the sequence protocol (__len__() and __getitem__()). Objects that support the sequence protocol should only provide __reversed__() if they can provide an implementation that is more efficient than the one provided by reversed().

Probably is because of that.

Im not an expert but i saw the question is 7 months without answer and tryied to give you one based on python documentation.

Those are the resouces i used:

https://docs.python.org/3/library/functions.html#reversed https://docs.python.org/3/reference/datamodel.html#object

Upvotes: -1