Reputation: 105
I have a numpy array "my_data". I am trying to split this dataset randomly. However, when I do this using the following code, I get a "train" array and a "test" array. Train array and test array have some rows in column.
training_idx = np.random.randint(my_data.shape[0], size=split_size)
test_idx = np.random.randint(my_data.shape[0], size=len(my_data)-split_size)
train, test = my_data[training_idx,:], my_data[test_idx,:]
My intention is to find train array first randomly and then whatever rows are left in my_data which are not in train array, to be a part of test array.
Is there a way in numpy to do so ? (I am refraining from using sklearn to split my data)
I referred to this post here to get here with my dataset. How to split/partition a dataset into training and test datasets for, e.g., cross validation?
If I code per this post’s logic I end up getting train and test data sets where train and test have some redundant rows in them. I intend on making train and test datasets where no rows are common.
Upvotes: 2
Views: 193
Reputation: 1312
Following this answer you can do:
train_idx = np.random.randint(my_data.shape[0], size=split_size)
mask = np.ones_like(my_data, dtype=bool)
mask[train_idx] = False
train, test = my_data[~mask], my_data[mask]
Although, a more natural way would be to slice a permutation of your data, as Poojan suggested.
permuted = np.random.permutation(my_data)
train, test = permuted[:split_size], permuted[split_size:]
Upvotes: 1