Generating very large 2D-array in Python?

Question

I'd like to generate very large 2D-array (or, in other terms, a matrix) using list of lists. Each element should be a float.

So, just to give an example, let's assume to have the following code:

import numpy as np

N = 32000

def largeMat():
    m = []
    for i in range(N):
        l = list(np.ones(N))
        m.append(l)
        if i % 1000 == 0:
            print i
    return m

m = largeMat()

I have 12GB of RAM, but as the code reaches the 10000-th line of the matrix, my RAM is already full. Now, if I'm not wrong, each float is 64-bit large (or 8 byte), so the total occupied RAM should be:

32000 * 32000 * 8 / 1 MB = 8192 MB

Why does python fill my whole RAM and even start to allocate into swap?

DrV · Accepted Answer

Python does not necessarily store list items in the most compact form, as lists require pointers to the next item, etc. This is a side effect of having a data type which allows deletes, inserts, etc. For a simple two-way linked list the usage would be two pointers plus the value, in a 64-bit machine that would be 24 octets per float item in the list. In practice the implementation is not that stupid, but there is still some overhead.

If you want to have a concise format, I'd suggest using a numpy.array as it will take exactly as many bytes you think it'd take (plus a small overhead).

Edit Oops. Not necessarily. Explanation wrong, suggestion valid. numpy is the right tool as numpy.array exists for this reason. However, the problem is most probably something else. My computer will run the procedure even though it takes a lot of time (appr. 2 minutes). Also, quitting python after this takes a long time (actually, it hung). Memory use of the python process (as reported by top) peaks at 10 000 MB and then falls down to slightly below 9 000 MB. Probably the allocated numpy arrays are not garbage collected very fast.

But about the raw data size in my machine:

>>> import sys
>>> l = [0.0] * 1000000
>>> sys.getsizeof(l)
8000072

So there seems to be a fixed overhead of 72 octets per list.

>>> listoflists = [ [1.0*i] * 1000000 for i in range(1000)]
>>> sys.getsizeof(listoflists)
9032
>>> sum([sys.getsizeof(l) for l in listoflists])
8000072000

So, this is as expected.

On the other hand, reserving and filling the long list of lists takes a while (about 10 s). Also, quitting python takes a while. The same for numpy:

>>> a = numpy.empty((1000,1000000))
>>> a[:] = 1.0
>>> a.nbytes
8000000000

(The byte count is not entirely reliable, as the object itself takes some space for its metadata, etc. There has to be the pointer to the start of the memory block, data type, array shape, etc.)

This takes much less time. The creation of the array is almost instantaneous, inserting the numbers takes maybe a second or two. Allocating and freeing a lot of small memory chunks is time consuming and while it does not cause fragmentation problems in a 64-bit machine, it is still much easier to allocate a big chunk of data.

If you have a lot of data which can be put into an array, you need a good reason for not using numpy.

Generating very large 2D-array in Python?

Answers (1)

Related Questions