Splitting dataset into two non-redundant numpy arrays?

Question

I have a numpy array "my_data". I am trying to split this dataset randomly. However, when I do this using the following code, I get a "train" array and a "test" array. Train array and test array have some rows in column.

training_idx = np.random.randint(my_data.shape[0], size=split_size)
test_idx = np.random.randint(my_data.shape[0], size=len(my_data)-split_size)
train, test = my_data[training_idx,:], my_data[test_idx,:]

My intention is to find train array first randomly and then whatever rows are left in my_data which are not in train array, to be a part of test array.

Is there a way in numpy to do so ? (I am refraining from using sklearn to split my data)

I referred to this post here to get here with my dataset. How to split/partition a dataset into training and test datasets for, e.g., cross validation?

If I code per this post’s logic I end up getting train and test data sets where train and test have some redundant rows in them. I intend on making train and test datasets where no rows are common.

josemz · Accepted Answer

Following this answer you can do:

train_idx = np.random.randint(my_data.shape[0], size=split_size)    
mask = np.ones_like(my_data, dtype=bool)
mask[train_idx] = False
train, test = my_data[~mask], my_data[mask]

Although, a more natural way would be to slice a permutation of your data, as Poojan suggested.

permuted = np.random.permutation(my_data)
train, test = permuted[:split_size], permuted[split_size:]

Splitting dataset into two non-redundant numpy arrays?

Answers (1)

Related Questions