havakok
havakok

Reputation: 1247

numpy memory error converting a big dataset from list to numpy array

I am preprocessing a large dataset for an NN training. My dataset is accumulated in features = list().

When attempting features = np.array(features) I am getting:

numpy.core._exceptions.MemoryError: Unable to allocate 29.6 GiB for an array with shape (37990, 605, 173) and data type float64

I have seen a number of solutions in other posts, like saving and reloading, which did not work due to np.save converting to an array first, or using uint8 for images, or a lower memory format when possible.

The problem is, that my input is a tensor bot, not an image. I am not sure what are the maximal values and due to my classification task, I don't know if I can use another format. I am trying to avoid using a keras generator due to the implementation overhead. So, my question is, is there a way of handling this dataset without the use of a generator?

Upvotes: 0

Views: 1153

Answers (1)

Itamar Turner-Trauring
Itamar Turner-Trauring

Reputation: 3900

You can use numpy's mmap() support: this will back the data by a file on disk, while still acting like a normal numpy array. So it doesn't have to fit in memory.

https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

See https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/ for explanation of how this works.

Upvotes: 1

Related Questions