Reputation: 25
I have a problem with spliting a np. array and list into two. Here is my code:
X = []
y = []
for seq, target in ConvertedData:
X.append(seq)
y.append(target)
y = np.vstack(y)
train_x = np.array(X)[:int(len(X) * 0.9)]
train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]
validation_y = y[int(len(X) * 0.9):]
This is a sample of code that prepares data for neural network. Works great, but generates "out of memory error" (i have 32GB on board):
Traceback (most recent call last):
File "D:/Projects/....Here is a file location.../FileName.py", line 120, in <module>
validation_x = np.array(X)[int(len(X) * 0.9):]
MemoryError
It seems like it keeps in memory list X and np.array y and duplicates it as separate variablest train_x, train_y, validation_x, validation_y. Do you know how to deal with this?
Shape of X:(324000, 256, 24)
Shape of y:(324000,10)
Shape of train_x: (291600, 256, 24)
Shape of train_y:(291600,10)
Shape of validation_x:(32400, 256, 24)
Shape of validation_y:(32400,10)
Upvotes: 1
Views: 820
Reputation: 231500
X = []
y = []
for seq, target in ConvertedData:
X.append(seq)
y.append(target)
X
is a list of seq
. I assume those are arrays. X
just has pointers to those,
y = np.vstack(y)
train_x = np.array(X)[:int(len(X) * 0.9)]
Makes an array from X
, and then a slice of that array. The full np.array(X)
still exists in memory
train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]
Makes another array from X
. train_x
and validation_x
are views of separate arrays.
validation_y = y[int(len(X) * 0.9):]
Doing
X1 = np.array(X)
train_x = X1[:...]
validation_x = X1[...:]
will eliminate that duplication. Both are views of the same X1
.
Another approach would be to slice the list first:
train_x = np.array(X[:...])
validation_x = np.array(X[...:])
My guess is that memory use, at least with in the arrays will be similar.
del X
after creating the X1
might also help, allowing X
and the arrays it references to be garbage collected.
But beware that once you start hitting a memory error at one point in your code, tricks like this might postpone it. Calculations can easily end up making copies, or temporary buffers, of comparable size.
Your split uses 2 slices; that results in views, which don't add to the original memory use. But if you make a shuffled split, the train and validation parts will be copies, and together take up as much memory as the source.
Upvotes: 1
Reputation: 1640
As described in answer of memory errors. You can pickle each array of trainig data to file like in this question.
You can split by train_test_split, it could be more efficient way of performing spliting.
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Upvotes: 0