ketzul
ketzul

Reputation: 25

Python: How to split np array / list into two arrays / lists without using more RAM

I have a problem with spliting a np. array and list into two. Here is my code:

X = []
y = []
for seq, target in ConvertedData:
    X.append(seq)
    y.append(target)

y = np.vstack(y)

train_x = np.array(X)[:int(len(X) * 0.9)]
train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]
validation_y = y[int(len(X) * 0.9):]

This is a sample of code that prepares data for neural network. Works great, but generates "out of memory error" (i have 32GB on board):

Traceback (most recent call last):
  File "D:/Projects/....Here is a file location.../FileName.py", line 120, in <module>
    validation_x = np.array(X)[int(len(X) * 0.9):]
MemoryError

It seems like it keeps in memory list X and np.array y and duplicates it as separate variablest train_x, train_y, validation_x, validation_y. Do you know how to deal with this?

Shape of X:(324000, 256, 24)

Shape of y:(324000,10)

Shape of train_x: (291600, 256, 24)

Shape of train_y:(291600,10)

Shape of validation_x:(32400, 256, 24)

Shape of validation_y:(32400,10)

Upvotes: 1

Views: 820

Answers (2)

hpaulj
hpaulj

Reputation: 231500

X = []
y = []
for seq, target in ConvertedData:
    X.append(seq)
    y.append(target)

X is a list of seq. I assume those are arrays. X just has pointers to those,

y = np.vstack(y)

train_x = np.array(X)[:int(len(X) * 0.9)]

Makes an array from X, and then a slice of that array. The full np.array(X) still exists in memory

train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]

Makes another array from X. train_x and validation_x are views of separate arrays.

validation_y = y[int(len(X) * 0.9):]

Doing

X1 = np.array(X)
train_x = X1[:...]
validation_x = X1[...:]

will eliminate that duplication. Both are views of the same X1.

Another approach would be to slice the list first:

train_x = np.array(X[:...])
validation_x = np.array(X[...:])

My guess is that memory use, at least with in the arrays will be similar.

del X after creating the X1 might also help, allowing X and the arrays it references to be garbage collected.

But beware that once you start hitting a memory error at one point in your code, tricks like this might postpone it. Calculations can easily end up making copies, or temporary buffers, of comparable size.


Your split uses 2 slices; that results in views, which don't add to the original memory use. But if you make a shuffled split, the train and validation parts will be copies, and together take up as much memory as the source.

Upvotes: 1

ElConrado
ElConrado

Reputation: 1640

As described in answer of memory errors. You can pickle each array of trainig data to file like in this question.

You can split by train_test_split, it could be more efficient way of performing spliting.

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Upvotes: 0

Related Questions