Reputation: 930
When trying to flatten a list of lists using python 2.7's built-in sum
function, I've ran across some performance issues - not only was the computation slow, but the iterative approach yielded much faster results.
The short code below seems to illustrate this performance gap:
import timeit
def sum1(arrs):
return sum(arrs, [])
def sum2(arrs):
s = []
for arr in arrs:
s += arr
return s
def main():
array_of_arrays = [[0] for _ in range(1000)]
print timeit.timeit(lambda: sum1(array_of_arrays), number=100)
print timeit.timeit(lambda: sum2(array_of_arrays), number=100)
if __name__=='__main__':
main()
On my laptop, I get as output:
>> 0.247241020203
>> 0.0043830871582
Could anyone explain to me why is it so?
Upvotes: 4
Views: 786
Reputation: 282026
Your sum2
uses +=
:
for arr in arrs:
s += arr
sum
does not use +=
. sum
is defined to use +
. The difference is that s += arr
is allowed to perform the operation by mutating the existing s
list, while s = s + arr
must construct a new list, copying the buffers of the old lists.
With +=
, Python can use an efficient list resizing strategy that requires an amount of copying proportional to the size of the final list. For N
lists of length K
each, this takes time proportional to N*K
.
With +
, Python cannot do that. For every s = s + arr
, Python must copy the entire s
and arr
lists to construct the new s
. For N
lists of size K
each, the total time spent copying is proportional to N**2 * K
, much worse.
Because of this, you should pretty much never use sum
to concatenate sequences.
Upvotes: 6