Reputation: 344
I want shuffle my dataset, I saw in the github in this code, shuffle is after splitting data, my question is whats deffrence between shuffling after spliting with when we don't shuffle it??? which one is correct?? I think it should be before splitting.
if split == 'train':
images = train_images[:50000]
labels = train_labels[:50000]
elif split == 'val':
images = train_images[50000:60000]
labels = train_labels[50000:60000]
elif split == 'test':
images = test_images
labels = test_labels
if randomize:
rng_state = np.random.get_state()
np.random.shuffle(images)
np.random.set_state(rng_state)
np.random.shuffle(labels)
or
if randomize:
rng_state = np.random.get_state()
np.random.shuffle(images)
np.random.set_state(rng_state)
np.random.shuffle(labels)
if split == 'train':
images = train_images[:50000]
labels = train_labels[:50000]
elif split == 'val':
images = train_images[50000:60000]
labels = train_labels[50000:60000]
elif split == 'test':
images = test_images
labels = test_labels
Upvotes: 1
Views: 1948
Reputation: 4629
It makes sense to shuffle the dataset only before the split
If you shuffle the dataset after the split, the shuffle will not affect the performance, you are changing only the instances order
Basically, if you shuffle before the split, you obtain different sets for your training / validation / test sets
If you shuffle after the split, you have always the same sets.
Example:
1) shuffling before split
my_set = ["A", "B", "C", "D", "E", "F", "G"]
shuffle(my_set)
# ["B", "A", "D", "E", "C", "G", "F"]
train = my_set[:3] # ["B", "A", "D"]
val = my_set[3:5] # ["E", "C"]
test = my_set[5:-1] # ["G", "F"]
2) shuffling after split
my_set = ["A", "B", "C", "D", "E", "F", "G"]
train = my_set[:3] # ["A", "B", "C"]
val = my_set[3:5] # ["D", "E"]
test = my_set[5:-1] # ["F", "G"]
new_train = shuffle(train, inplace=False)
new_val = shuffle(val, inplace=False)
new_test = shuffle(test, inplace=False)
set(new_train) == set(train) #True
set(new_val) == set(val) #True
set(new_test) == set(test) #True
NOTE: When training the set order may affect the performance, for example when you use algorithms in which you use batches and derivative approaches
Upvotes: 3