Dreams
Dreams

Reputation: 6122

Python : Numpy memory error on creating a 3d array. Whats the better way to fill a 3d array

I am making a 3d array of zeros and then filling it. But, due to the size of the numpy array it runs into memory issues even with 64 gb ram. Am i doing it wrong?

X_train_one_hot shape is (47827, 30, 20000) and encInput is of shape (47827, 30, 200)

X_train_one_hot_shifted = np.zeros((X_train_one_hot.shape[0], 30, 20200))
#X_train_one_hot.shape[0] = 48000
for j in range(0, X_train_one_hot.shape[0]):
    current = np.zeros((30, 20000))
    current[0][0] = 1

    current[1:] = X_train_one_hot[j][0:29]
#     print(current.shape, encInput[i].shape)
    combined = np.concatenate((current,encInput[j]), axis=1)

    X_train_one_hot_shifted[j] = combined

Any ideas to reduce memory consumption? Another interesting thing is since the shape of X_train_one_hot is also almost same, but that does not throw any error.

EDIT : The program gets killed in the for loop with the error message :

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.

Also, most of the array is sparse since X_train_one_hot a one_hot encoding of 20000 size

Upvotes: 2

Views: 1114

Answers (2)

lightalchemist
lightalchemist

Reputation: 10221

Imtinan Azhar is correct. You simply do not have enough RAM to hold the array.

You have a few options.

1) You seem to have a very sparse matrix even though the size is large. So you can try to use one of the sparse matrix representation from Scipy.

If you are throwing the array into a library package such as Scikit-Learn or one of those Deep Learning libraries, this will likely not work.

2) Most DL libraries don't need you to load all your data at once. You can prepare your data in batches - create this matrix in batch and save it out to file (preferably using a sparse matrix representation). Then use a data generator to feed your algorithm, or manually load in batches of your data for your algorithm.

3) If these are all not possible, then you can try to memory map the array using Numpy's memmap. Some further examples can be found here.

4) Another option is to use dask and manually get slices of the data when necessary.

Personally, I would go with option 2, or 1 if your algorithms that take in the matrix can handle (or be modified to handle) sparse matrices.

Upvotes: 3

Imtinan Azhar
Imtinan Azhar

Reputation: 1753

Lets see your X_train_one_hot_shifted.shape is (48000,30,20200) that is 28983162000 floats.

28983162000*8 gives you the memory consumption for this array in bytes. Which is 231865296000 bytes

Lets simplify this

231865296000 b

226430953.125 kb

221123.977661 mb

215.941384435 gb

You need 215Gb of RAM to fit X_train_one_hot_shifted into your RAM, i think the shape 20200 is a typo, look it up

Upvotes: 1

Related Questions