cjameson
cjameson

Reputation: 53

Extend vs list comprehension for flat list of sublists

Given old_list = list(range(1000)), there are two ways for creating a new, flat list from multiple sublists that I'm familiar with:

  1. Using extend so that the new list is automatically flat. MWE:
new_list = []
for i in range(10,20):
    new_list.extend(old_list[:i])
  1. Using a list comprehension and then going back and flattening. MWE:
new_list = [old_list[:i] for i in range(10,20)]
new_list = [item for sublist in new_list for item in sublist]

I'm confused on which of these is more efficient for longer lists, and if there's any way one would use less memory than the other. It seems like the latter is more Pythonic, but I don't like having to go back and flatten, and I haven't found much discussion on the overhead of extend (whereas there's a lot about append).

Upvotes: 0

Views: 203

Answers (1)

cjameson
cjameson

Reputation: 53

Practical answer

@juanpa.arrivillaga was exactly correct in their diagnosis of the problem; Python, being a smart language, does not bloat the memory for this type of operation because it simply fills the new list with the same objects in memory. I checked this using the memory-profiler package available from PyPI with the elements of the lists being more-relevant ndarray objects:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    26     27.8 MiB     27.8 MiB           1   @profile
    27                                         def double_listcomp_test():
    28   1904.3 MiB   1876.6 MiB       10003       old_list = [a for a in
    29   1904.3 MiB      0.0 MiB           1                   np.random.randint(0,255,(10000,256,256,3),dtype=np.uint8)]
    30   1904.4 MiB      0.1 MiB        5008       old_list = [a for _ in range(5) for a in random.sample(old_list,k=1000)]
    31
    32   1904.4 MiB      0.0 MiB           1       print(len(old_list))
    33   1904.4 MiB      0.0 MiB           1       return old_list

This approach is working perfectly for me, and I think is at least as readable as the other two approaches for a Python programmer used to working with generator expressions. We are essentially just thinking about the collection of sublists as an iterator rather than its own list. My "accepted" approach for the above example is then:

my_list = list(range(1000))
my_list = [item for i in range(10,20) for item in my_list[:i]]

For memory management

What I actually figured out I needed to do was to make copies of the arrays into that list so that I could delete the bigger old list. This isn't really within the scope of the question, as it obviously leads to larger bloat at the time of creation, but I wanted to include it since this question is what led me here. In this case I think the commented suggestion is especially correct, because it avoids having to unPythonically del old_list (see below).

Two listcomps to flatten (as in question):

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    16     27.7 MiB     27.7 MiB           1   @profile
    17                                         def listcomp_test():
    18   1904.3 MiB   1876.6 MiB       10003       old_list = [a for a in
    19   1904.3 MiB      0.0 MiB           1                   np.random.randint(0,255,(10000,256,256,3),dtype=np.uint8)]
    20   1904.4 MiB      0.1 MiB           8       new_list = [random.sample(old_list,k=1000) for _ in range(5)]
    21   2842.6 MiB    938.3 MiB        5008       new_list = [item.copy() for sublist in new_list for item in sublist]
    22
    23   2842.7 MiB      0.0 MiB           1       print(len(new_list))
    24   2842.7 MiB      0.0 MiB           1       return new_list

Two listcomps with del old_list:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    16     27.7 MiB     27.7 MiB           1   @profile
    17                                         def listcomp_test():
    18   1904.3 MiB   1876.6 MiB       10003       old_list = [a for a in
    19   1904.3 MiB      0.0 MiB           1                   np.random.randint(0,255,(10000,256,256,3),dtype=np.uint8)]
    20   1904.4 MiB      0.1 MiB           8       new_list = [random.sample(old_list,k=1000) for _ in range(5)]
    21   2842.7 MiB    938.3 MiB        5008       new_list = [item.copy() for sublist in new_list for item in sublist]
    22    967.1 MiB  -1875.5 MiB           1       del old_list
    23
    24    967.2 MiB      0.0 MiB           1       print(len(new_list))
    25    967.2 MiB      0.0 MiB           1       return new_list

Single listcomp (as in comment):

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    26     27.8 MiB     27.8 MiB           1   @profile
    27                                         def double_listcomp_test():
    28   1904.3 MiB   1876.6 MiB       10003       my_list = [a for a in
    29   1904.3 MiB      0.0 MiB           1                   np.random.randint(0,255,(10000,256,256,3),dtype=np.uint8)]
    30   2842.6 MiB   -937.2 MiB        5008       my_list = [a.copy() for _ in range(5) for a in random.sample(my_list,k=1000)]
    31
    32    967.1 MiB  -1875.5 MiB           1       print(len(my_list))
    33    967.1 MiB      0.0 MiB           1       return my_list

NOTE: Python is certainly not the best language to be using if you're this concerned about memory management, but in data science applications we often don't have a choice, so I think having this available as an answer is prudent. In every case Python will have to take up the extra memory while making the new list, but by using the suggested listcomp approach we avoid needlessly assigning things to new variables, and thus we keep this around for as little time as possible.

Upvotes: 1

Related Questions