Reputation: 23354
In the following trivial examples there are two functions that sort a list of random numbers. The first method passes sorted
a generator expression, the second method creates a list first:
import random
l = [int(1000*random.random()) for i in xrange(10*6)]
def sort_with_generator():
return sorted(a for a in l)
def sort_with_list():
return sorted([a for a in l])
Benchmarking with line profiler indicates that the second option (sort_with_list
) is about twice as fast as the generator expression.
Can anyone explain what's happening, and why the first method is so much slower than the second?
Upvotes: 6
Views: 2120
Reputation: 3131
If you look at the source for sorted
, any sequence you pass in gets copied into a new list first.
newlist = PySequence_List(seq);
generator
--> list
appears to be slower than list
--> list
.
>>> timeit.timeit('x = list(l)', setup = 'l = xrange(1000)')
16.656711101531982
>>> timeit.timeit('x = list(l)', setup = 'l = range(1000)')
4.525658845901489
As to why a copy must be made, think about how sorting works. Sorts aren't linear algorithms. We move through the data multiple times, sometimes traversing data in both directions. A generator is intended for producing a sequence through which we iterate once and only once, from start to somewhere after it. A list allows for random access.
On the other hand, creating a list from a generator will mean only one list in memory, while making a copy of a list will mean two lists in memory. Good 'ol fashioned space-time tradeoff.
Python uses Timsort, a hybrid of merge sort and insertion sort.
Upvotes: 3
Reputation: 471
List expressions, firstly, loads data into a memory. Then doing any operations with resulting list. Let the allocation time is T2
(for second case).
Generator expressions not allocating time at once, but it change iterator value for time t1[i]
. Sum of all t1[i]
will be T1
. T1
≈ T2
.
But when you call sorted()
, in the first case time T1
added with time of allocation memory of every pairs compared to sort (tx1[i]
). In result, T1
added with sum of all tx1[i]
.
Therefore, T2
< T1 + sum(tx1[i])
Upvotes: 0
Reputation: 172437
Your first example is a generator expression that iterates over a list. Your second example is a list expression that iterates over a list. Indeed, the second example is slightly faster.
>>> import timeit
>>> timeit("sorted(a for a in l)", setup="import random;l = [int(1000*random.random()) for i in xrange(10*6)]")
5.963912010192871
>>> timeit("sorted([a for a in l])", setup="import random;l = [int(1000*random.random()) for i in xrange(10*6)]")
5.021576881408691
The reason for this is undoubtedly that making a list is done in one go, while iterating over a generator requires function calls.
Generators are not to speed up small lists like this (you have 60 elements in the list, that's very small). It's to save memory when creating long lists, primarily.
Upvotes: 6