Emsi
Emsi

Reputation: 422

python list of lists takes 10 times more memory than expected

The python list of list is super inefficient. Even in simplest case it takes about 10 times more memory than simple list with the very same data. The following code illustrated the problem:

import memory_profiler
import sys

t=[]

m=memory_profiler.memory_usage()
t= ["one two three four five" for l in range(60000000)]
print ("getsize={} | memory_usage={}".format(sys.getsizeof(t),memory_profiler.memory_usage()[0]-m[0]))


m=memory_profiler.memory_usage()
t= [["one two three four five"] for l in range(60000000)]
print ("getsize={} | memory_usage={}".format(sys.getsizeof(t),memory_profiler.memory_usage()[0]-m[0]))

import gc
gc.collect()
print ("getsize={} | memory_usage={}".format(sys.getsizeof(t),memory_profiler.memory_usage()[0]-m[0]))

This code outputs something like:

getsize=536721616 | memory_usage=459.4375
getsize=536721616 | memory_usage=4181.0703125
getsize=536721616 | memory_usage=4181.0703125

Is there any way to cure this behavior?

Using generator is not an option, I need the data in memory for later processing by multiple workers. The data has to be in list of lists format.

Upvotes: 1

Views: 720

Answers (1)

Matthieu Moy
Matthieu Moy

Reputation: 16517

Lists in python are resizeable arrays. To give you an efficient implementation of list.append(), some memory is preallocated after the list's actual storage.

Tuple do not have this issue, they are immutable hence the memory allocation can fit the actually used size. Try changing ["one two three four five"] to ("one two three four five",) (don't forget the , as I did in my original message, otherwise it's not a tuple): the memory usage drops considerably. I get (for 600000 to avoid blowing off the RAM of my laptop):

getsize=4826320 | memory_usage=18.94140625
getsize=4826320 | memory_usage=4.2421875
getsize=4826320 | memory_usage=4.2421875

It's actually surprising that it drops below the plain list implementation, but I don't have an explanation for that!

Anyway, if you're starting to worry about performance, it most likely mean that Python, or at least Python's native data-structure are not the right tool. For example, if you want to manipulate matrices, lists of lists of numbers will be a real pain while NumPy's array will be both efficient and convenient.

Upvotes: 2

Related Questions