Mnist dataset splitting

can anyone help me out in splitting mnist dataset into training , testing and validation as per our wish of ratios.

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

Use 70-20-10 split for training, validation and testing.

Upvotes: 3

Answers (2)

Davide Anghileri

Reputation: 901

Assuming that you do not want to maintain the default split between train and test provided by tf.keras.datasets.mnist API you can add toghether train and test sets and then iteratively split them into train, val and test based on your ratios.

from sklearn.model_selection import train_test_split
import tensorflow as tf

DATASET_SIZE = 70000
TRAIN_RATIO = 0.7
VALIDATION_RATIO = 0.2
TEST_RATIO = 0.1

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

X = np.concatenate([x_train, x_test])
y = np.concatenate([y_train, y_test])

If you want the datasets to be numpy arrays you can use the sklearn.model_selection import train_test_split() function. Here an example:

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=(1-TRAIN_RATIO))
X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size=((TEST_RATIO/(VALIDATION_RATIO+TEST_RATIO))))

If you prefer to use the tf Dataset API then you can use the .take() and .skip() methods as follows:

dataset = tf.data.Dataset.from_tensor_slices((X, y))

train_dataset = dataset.take(int(TRAIN_RATIO*DATASET_SIZE))
validation_dataset = dataset.skip(int(TRAIN_RATIO*DATASET_SIZE)).take(int(VALIDATION_RATIO*DATASET_SIZE))
test_dataset = dataset.skip(int(TRAIN_RATIO*DATASET_SIZE)).skip(int(VALIDATION_RATIO*DATASET_SIZE))

Furthermore, you could add the .shuffle() to your dataset before the split to generate shuffled partitions:

dataset = dataset.shuffle()

Upvotes: 3

marsolmos

Reputation: 794

This approach should do it. It basically uses iteratively the train_test_split function from tensorflow to split dataset into validation-test-train:

train_ratio = 0.70
validation_ratio = 0.20
test_ratio = 0.10

# train is now 70% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)

# test is now 10% of the initial data set
# validation is now 20% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))

Upvotes: 0

Mnist dataset splitting

Answers (2)

Related Questions