Reputation: 422
The python list of list is super inefficient. Even in simplest case it takes about 10 times more memory than simple list with the very same data. The following code illustrated the problem:
import memory_profiler
import sys
t=[]
m=memory_profiler.memory_usage()
t= ["one two three four five" for l in range(60000000)]
print ("getsize={} | memory_usage={}".format(sys.getsizeof(t),memory_profiler.memory_usage()[0]-m[0]))
m=memory_profiler.memory_usage()
t= [["one two three four five"] for l in range(60000000)]
print ("getsize={} | memory_usage={}".format(sys.getsizeof(t),memory_profiler.memory_usage()[0]-m[0]))
import gc
gc.collect()
print ("getsize={} | memory_usage={}".format(sys.getsizeof(t),memory_profiler.memory_usage()[0]-m[0]))
This code outputs something like:
getsize=536721616 | memory_usage=459.4375
getsize=536721616 | memory_usage=4181.0703125
getsize=536721616 | memory_usage=4181.0703125
Is there any way to cure this behavior?
Using generator is not an option, I need the data in memory for later processing by multiple workers. The data has to be in list of lists format.
Upvotes: 1
Views: 720
Reputation: 16517
Lists in python are resizeable arrays. To give you an efficient implementation of list.append()
, some memory is preallocated after the list's actual storage.
Tuple do not have this issue, they are immutable hence the memory allocation can fit the actually used size. Try changing ["one two three four five"]
to ("one two three four five",)
(don't forget the ,
as I did in my original message, otherwise it's not a tuple): the memory usage drops considerably. I get (for 600000 to avoid blowing off the RAM of my laptop):
getsize=4826320 | memory_usage=18.94140625 getsize=4826320 | memory_usage=4.2421875 getsize=4826320 | memory_usage=4.2421875
It's actually surprising that it drops below the plain list implementation, but I don't have an explanation for that!
Anyway, if you're starting to worry about performance, it most likely mean that Python, or at least Python's native data-structure are not the right tool. For example, if you want to manipulate matrices, lists of lists of numbers will be a real pain while NumPy's array will be both efficient and convenient.
Upvotes: 2