bsam
bsam

Reputation: 930

Why is python's built in sum function slow when used to flatten a list of lists?

When trying to flatten a list of lists using python 2.7's built-in sum function, I've ran across some performance issues - not only was the computation slow, but the iterative approach yielded much faster results.

The short code below seems to illustrate this performance gap:

import timeit

def sum1(arrs):
    return sum(arrs, [])

def sum2(arrs):
    s = []
    for arr in arrs:
        s += arr
    return s

def main():
    array_of_arrays = [[0] for _ in range(1000)]
    print timeit.timeit(lambda: sum1(array_of_arrays), number=100)
    print timeit.timeit(lambda: sum2(array_of_arrays), number=100)

if __name__=='__main__':
    main()

On my laptop, I get as output:

>> 0.247241020203
>> 0.0043830871582

Could anyone explain to me why is it so?

Upvotes: 4

Views: 786

Answers (1)

user2357112
user2357112

Reputation: 282026

Your sum2 uses +=:

    for arr in arrs:
        s += arr

sum does not use +=. sum is defined to use +. The difference is that s += arr is allowed to perform the operation by mutating the existing s list, while s = s + arr must construct a new list, copying the buffers of the old lists.

With +=, Python can use an efficient list resizing strategy that requires an amount of copying proportional to the size of the final list. For N lists of length K each, this takes time proportional to N*K.

With +, Python cannot do that. For every s = s + arr, Python must copy the entire s and arr lists to construct the new s. For N lists of size K each, the total time spent copying is proportional to N**2 * K, much worse.

Because of this, you should pretty much never use sum to concatenate sequences.

Upvotes: 6

Related Questions