sha_hla
sha_hla

Reputation: 344

shuffling after splitting data or before that?

I want shuffle my dataset, I saw in the github in this code, shuffle is after splitting data, my question is whats deffrence between shuffling after spliting with when we don't shuffle it??? which one is correct?? I think it should be before splitting.

    if split == 'train':
        images = train_images[:50000]
        labels = train_labels[:50000]
    elif split == 'val':
        images = train_images[50000:60000]
        labels = train_labels[50000:60000]
    elif split == 'test':
        images = test_images
        labels = test_labels

    if randomize:
        rng_state = np.random.get_state()
        np.random.shuffle(images)
        np.random.set_state(rng_state)
        np.random.shuffle(labels)

or

    if randomize:
        rng_state = np.random.get_state()
        np.random.shuffle(images)
        np.random.set_state(rng_state)
        np.random.shuffle(labels)

    if split == 'train':
        images = train_images[:50000]
        labels = train_labels[:50000]
    elif split == 'val':
        images = train_images[50000:60000]
        labels = train_labels[50000:60000]
    elif split == 'test':
        images = test_images
        labels = test_labels

Upvotes: 1

Views: 1948

Answers (1)

Nikaido
Nikaido

Reputation: 4629

It makes sense to shuffle the dataset only before the split

If you shuffle the dataset after the split, the shuffle will not affect the performance, you are changing only the instances order

Basically, if you shuffle before the split, you obtain different sets for your training / validation / test sets

If you shuffle after the split, you have always the same sets.


Example:

1) shuffling before split

my_set = ["A", "B", "C", "D", "E", "F", "G"]
shuffle(my_set)
# ["B", "A", "D", "E", "C", "G", "F"]
train = my_set[:3] # ["B", "A", "D"]
val = my_set[3:5] # ["E", "C"]
test = my_set[5:-1] # ["G", "F"]

2) shuffling after split

my_set = ["A", "B", "C", "D", "E", "F", "G"]
train = my_set[:3] # ["A", "B", "C"]
val = my_set[3:5] # ["D", "E"]
test = my_set[5:-1] # ["F", "G"]
new_train = shuffle(train, inplace=False)
new_val = shuffle(val, inplace=False)
new_test = shuffle(test, inplace=False)
set(new_train) == set(train) #True
set(new_val) == set(val) #True
set(new_test) == set(test) #True

NOTE: When training the set order may affect the performance, for example when you use algorithms in which you use batches and derivative approaches

Upvotes: 3

Related Questions