Jack Twain
Jack Twain

Reputation: 6372

A list with 70 MB on disk but 500MB in memory

I have a python list of string tuples of the form: lst = [('xxx', 'yyy'), ...etc]. The list has around 8154741 tuples. I used a profiler and it says that the list takes around 500 MB in memory. Then I wrote all tuples in the list into a text file and it took around 72MB on disk size.

I have three questions:

Upvotes: 4

Views: 516

Answers (3)

smci
smci

Reputation: 33940

Well are the strings mostly shared or unique? What is the significance of the tuples: bag-of-words or skip-gram representation? If so, one good library for vector representations of words is word2vec

and here's a good article on optimizing word2vec's performance

Do you actually need to keep your string contents in-memory, or can you just convert to a vector of features, and write the string<->feature correspondence to disk?

Upvotes: 0

jtaylor
jtaylor

Reputation: 2424

you have 8154741 tuples, that means your list, assuming 8 byte pointers, already contains 62 MB of pointers to tuples. Assuming each tuple contains two ascii strings in python2, thats another 124 MB of pointers for each tuple. Then you still have the overhead for the tuple and string objects, each object has a reference count, assuming that is a 8 byte integer you have another 186 MB of reference count storage. That is already 372 MB of overhead for the 46 MB of data you would have with two 3 byte long strings in size 2 tuples. Under python3 your data is unicode and may be larger than 1 byte per character too.

So yes it is expected this type of structure consumes an large amount of excess memory.

If your strings are all of similar length and the tuples all have the same length a way to reduce this is to use numpy string arrays. They store the strings in one continuous memory block avoiding the object overheads. But this will not work well if the strings vary in size a lot as numpy does not support ragged arrays.

>>> d = [("xxx", "yyy") for i in range(8154741)]
>>> a = numpy.array(d)
>>> print a.nbytes/1024**2
46
>>> print a[2,1]
yyy

Upvotes: 3

leeladam
leeladam

Reputation: 1758

Python objects can take much more memory than the raw data in them. This is because to achieve the features of Python's advanced and superfast data structures, you have to create some intermediate and temporary objects. Read more here.

Working around this issue has several ways, see a case study here. In most cases, it is enough to find the best suitable python data type for your application (would it not be better to use a numpy array instead of a list in your case?). For more optimizing, you can move to Cython where you can directly declare the types (and so, the sizes) of your variables, like in C.

There are also packages like IOPro that try to optimize memory usage (this one is commercial though, does anyone know a free package for this?).

Upvotes: 2

Related Questions