user1879926
user1879926

Reputation: 1323

Memory efficient way to split large numpy array into train and test

I have a large numpy array and when I run scikit learn's train_test_split to split the array into training and test data, I always run into memory errors. What would be a more memory efficient method of splitting into train and test, and why does the train_test_split cause this?

The follow code results in a memory error and causes a crash

import numpy as np
from sklearn.cross_validation import train_test_split

X = np.random.random((10000,70000))
Y = np.random.random((10000,))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state=42)

Upvotes: 17

Views: 18720

Answers (5)

sisem
sisem

Reputation: 61

Another way to use the sklearn split method with reduced memory usage is to generate an index vector of X and split on this vector. Afterwards you can select your entries and e.g. write training and test splits to the disk.

import h5py
import numpy as np
from sklearn.cross_validation import train_test_split

X = np.random.random((10000,70000))
Y = np.random.random((10000,))

x_ids = list(range(len(X)))
x_train_ids, x_test_ids, Y_train, Y_test = train_test_split(x_ids, Y, test_size = 0.33, random_state=42)

# Write

f = h5py.File('dataset/train.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_train_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_train, dtype=np.int)
f.close()

f = h5py.File('dataset/test.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_test_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_test, dtype=np.int)
f.close()

# Read

f = h5py.File('dataset/train.h5py', 'r')
X_train = np.array(f.get('inputs'), dtype=np.int)
Y_train = np.array(f.get('labels'), dtype=np.int)
f.close()

f = h5py.File('dataset/test.h5py', 'r')
X_test = np.array(f.get('inputs'), dtype=np.int)
Y_test = np.array(f.get('labels'), dtype=np.int)
f.close()

Upvotes: 6

user1879926
user1879926

Reputation: 1323

One method that I've tried which works is to store X in a pandas dataframe and shuffle

X = X.reindex(np.random.permutation(X.index))

since I arrive at the same memory error when I try

np.random.shuffle(X)

Then, I convert the pandas dataframe back to a numpy array and using this function, I can obtain a train test split

#test_proportion of 3 means 1/3 so 33% test and 67% train
def shuffle(matrix, target, test_proportion):
    ratio = int(matrix.shape[0]/test_proportion) #should be int
    X_train = matrix[ratio:,:]
    X_test =  matrix[:ratio,:]
    Y_train = target[ratio:,:]
    Y_test =  target[:ratio,:]
    return X_train, X_test, Y_train, Y_test

X_train, X_test, Y_train, Y_test = shuffle(X, Y, 3)

This works for now, and when I want to do k-fold cross-validation, I can iteratively loop k times and shuffle the pandas dataframe. While this suffices for now, why does numpy and sci-kit learn's implementations of shuffle and train_test_split result in memory errors for big arrays?

Upvotes: 10

tabata
tabata

Reputation: 469

I came across a similar problem.

As mentioned by @user1879926, I think shuffle is a main cause of memory exhaustion.

And ,as 'Shuffle' is claimed to be an invalid parameter for model_selection.train_test_split cited, train_test_split in sklearn 0.19 has option disabling shuffle.

So, I think you can escape from memory error by just adding shuffle=False option.

Upvotes: 5

dhanush-ai1990
dhanush-ai1990

Reputation: 335

I faced the same problem with my code. I was using a dense array like you and ran out of memory. I converted my training data to sparse (I am doing document classification) and solved my issue.

Upvotes: 1

DMML
DMML

Reputation: 1452

I suppose a more "memory efficient" way would be to iteratively select instances for training and testing (although, as is typical in computer science, you sacrifice the efficiency inherent in using matrices).

What you could do is iterate over the array and, for each instance, 'flip a coin' (use the random package) to determine whether you use the instance as training or testing and, depending upon which, storing the instance in the appropriate numpy array.

This iterative method shouldn't be bad for only 10000 instances. What is curious though is that 10000 X 70000 isn't all that large; what type of machine are you running? Makes me wonder whether it is a Python/numpy/scikit issue or a machine issue...

Anyway, hope that helps!

Upvotes: -1

Related Questions