Why shuffling the data like this leads to a poor accuracy

Question

I am reading the book "Hands-On Machine Learning" and I have a problem with exercise 9 of chapter 13, the exercise is as follows:

a. Load the Fashion MNIST dataset (introduced in Chapter 10); split it into a training set, a validation set, and a test set; shuffle the training set; and save each dataset to multiple TFRecord files. Each record should be a serialized Example protobuf with two features: the serialized image (use tf.io.serialize_tensor() to serialize each image), and the label.

b. Then use tf.data to create an efficient dataset for each set. Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature.

You can find the exercise with the solution at the end of this notebook: https://github.com/ageron/handson-ml2/blob/master/13_loading_and_preprocessing_data.ipynb

I loaded the data like this:

fashion_mnist = keras.datasets.fashion_mnist
(X_full_train, y_full_train), (X_test, y_test) = fashion_mnist.load_data()
X_train, X_valid = X_full_train[:55000], X_full_train[55000:]
y_train, y_valid = y_full_train[:55000], y_full_train[55000:]

# Why shuffling the data like this leads to a very poor accuracy?
X_train = shuffle(X_train) 

X_train = X_train.reshape(55000, 784)
X_valid = X_valid.reshape(5000, 784)
X_test = X_test.reshape(10000, 784)

train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]

def save_to_multiple_tfrecord_files(data, name_prefix, n_parts):
    # this function saves a dataset to multiple TFRecord files using the Example 
    # protobuf and returns the file paths (please use the link below to see the full code)

After that, I loaded the data using this function:

def tfrecord_reader_mnist(filepaths, ..., batch_size=32):
    # ...
    # load the data, preprocess it, and shuffle it (use the link below for the full code)
    # ...
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

After that, I standardized the data using a Standradization and create the model:

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

# ...

model = keras.models.Sequential([
    standardization,
    keras.layers.Flatten(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="nadam", metrics=["accuracy"])

model.fit(train_set, epochs=5, validation_data=valid_set,
          callbacks=[tensorboard_cb])

When the fit method is executed I had this accuracy:

Epoch 1/5
1719/1719 [==============================] - 7s 4ms/step - loss: 600.4416 - accuracy: 0.1032 - val_loss: 148.7675 - val_accuracy: 0.1236
# ...
Epoch 5/5
1719/1719 [==============================] - 7s 4ms/step - loss: 187.1646 - accuracy: 0.1338 - val_loss: 58.7837 - val_accuracy: 0.1564

If I just comment the line where I shuffle the data (this: X_train = shuffle(X_train) ), I get this accuracy:

Epoch 1/5
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4473 - accuracy: 0.8424 - val_loss: 0.3331 - val_accuracy: 0.8796
# ...
Epoch 5/5
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2464 - accuracy: 0.9089 - val_loss: 0.2185 - val_accuracy: 0.9179

The author book used the shuffle function from TensorFlow (I used that of sklearn).

Can anyone tell me why this happens?

If you want to test my code please follow use this notebook: https://colab.research.google.com/drive/1T6OJqwEeCIUJyAWwmIGoSaF86gUK9wPL?usp=sharing

I also copied the author's solution from the book into a notebook so you can test it straight away.: https://colab.research.google.com/drive/1z5a8SJg0tzeV5s7MTOY_uU-dCxSC8tjv

janluke · Accepted Answer

Because you are shuffling only the input data (X_train) without applying the same shuffling to the corresponding labels y_train. You should shuffle both together:

X_train, y_train = shuffle(X_train, y_train)

Why shuffling the data like this leads to a poor accuracy

Answers (1)

Related Questions