Python : Numpy memory error on creating a 3d array. Whats the better way to fill a 3d array

Question

I am making a 3d array of zeros and then filling it. But, due to the size of the numpy array it runs into memory issues even with 64 gb ram. Am i doing it wrong?

X_train_one_hot shape is (47827, 30, 20000) and encInput is of shape (47827, 30, 200)

X_train_one_hot_shifted = np.zeros((X_train_one_hot.shape[0], 30, 20200))
#X_train_one_hot.shape[0] = 48000
for j in range(0, X_train_one_hot.shape[0]):
    current = np.zeros((30, 20000))
    current[0][0] = 1

    current[1:] = X_train_one_hot[j][0:29]
#     print(current.shape, encInput[i].shape)
    combined = np.concatenate((current,encInput[j]), axis=1)

    X_train_one_hot_shifted[j] = combined

Any ideas to reduce memory consumption? Another interesting thing is since the shape of X_train_one_hot is also almost same, but that does not throw any error.

EDIT : The program gets killed in the for loop with the error message :

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.

Also, most of the array is sparse since X_train_one_hot a one_hot encoding of 20000 size

lightalchemist · Accepted Answer

Imtinan Azhar is correct. You simply do not have enough RAM to hold the array.

You have a few options.

1) You seem to have a very sparse matrix even though the size is large. So you can try to use one of the sparse matrix representation from Scipy.

If you are throwing the array into a library package such as Scikit-Learn or one of those Deep Learning libraries, this will likely not work.

2) Most DL libraries don't need you to load all your data at once. You can prepare your data in batches - create this matrix in batch and save it out to file (preferably using a sparse matrix representation). Then use a data generator to feed your algorithm, or manually load in batches of your data for your algorithm.

3) If these are all not possible, then you can try to memory map the array using Numpy's memmap. Some further examples can be found here.

4) Another option is to use dask and manually get slices of the data when necessary.

Personally, I would go with option 2, or 1 if your algorithms that take in the matrix can handle (or be modified to handle) sparse matrices.

Python : Numpy memory error on creating a 3d array. Whats the better way to fill a 3d array

Answers (2)

Related Questions