Python: How to split np array / list into two arrays / lists without using more RAM

Question

I have a problem with spliting a np. array and list into two. Here is my code:

X = []
y = []
for seq, target in ConvertedData:
    X.append(seq)
    y.append(target)

y = np.vstack(y)

train_x = np.array(X)[:int(len(X) * 0.9)]
train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]
validation_y = y[int(len(X) * 0.9):]

This is a sample of code that prepares data for neural network. Works great, but generates "out of memory error" (i have 32GB on board):

Traceback (most recent call last):
  File "D:/Projects/....Here is a file location.../FileName.py", line 120, in 
    validation_x = np.array(X)[int(len(X) * 0.9):]
MemoryError

It seems like it keeps in memory list X and np.array y and duplicates it as separate variablest train_x, train_y, validation_x, validation_y. Do you know how to deal with this?

Shape of X:(324000, 256, 24)

Shape of y:(324000,10)

Shape of train_x: (291600, 256, 24)

Shape of train_y:(291600,10)

Shape of validation_x:(32400, 256, 24)

Shape of validation_y:(32400,10)

hpaulj · Accepted Answer

X = []
y = []
for seq, target in ConvertedData:
    X.append(seq)
    y.append(target)

X is a list of seq. I assume those are arrays. X just has pointers to those,

y = np.vstack(y)

train_x = np.array(X)[:int(len(X) * 0.9)]

Makes an array from X, and then a slice of that array. The full np.array(X) still exists in memory

train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]

Makes another array from X. train_x and validation_x are views of separate arrays.

validation_y = y[int(len(X) * 0.9):]

Doing

X1 = np.array(X)
train_x = X1[:...]
validation_x = X1[...:]

will eliminate that duplication. Both are views of the same X1.

Another approach would be to slice the list first:

train_x = np.array(X[:...])
validation_x = np.array(X[...:])

My guess is that memory use, at least with in the arrays will be similar.

del X after creating the X1 might also help, allowing X and the arrays it references to be garbage collected.

But beware that once you start hitting a memory error at one point in your code, tricks like this might postpone it. Calculations can easily end up making copies, or temporary buffers, of comparable size.

Your split uses 2 slices; that results in views, which don't add to the original memory use. But if you make a shuffled split, the train and validation parts will be copies, and together take up as much memory as the source.

Python: How to split np array / list into two arrays / lists without using more RAM

Answers (2)

Related Questions