lolwatpie
lolwatpie

Reputation: 121

Splitting a tensorflow dataset into training, test, and validation sets from keras.preprocessing API

I'm new to tensorflow/keras and I have a file structure with 3000 folders containing 200 images each to be loaded in as data. I know that keras.preprocessing.image_dataset_from_directory allows me to load the data and split it into training/validation set as below:

val_data = tf.keras.preprocessing.image_dataset_from_directory('etlcdb/ETL9G_IMG/', 
                                                           image_size = (128, 127),
                                                           validation_split = 0.3,
                                                           subset = "validation",
                                                           seed = 1,
                                                           color_mode = 'grayscale',
                                                           shuffle = True)

Found 607200 files belonging to 3036 classes. Using 182160 files for validation.

But then I'm not sure how to further split my validation into a test split while maintaining proper classes. From what I can tell (through the GitHub source code), the take method simply takes the first x elements of the dataset, and skip does the same. I am unsure if this maintains stratification of the data or not, and I'm not quite sure how to return labels from the dataset to test it.

Any help would be appreciated.

Upvotes: 10

Views: 21129

Answers (3)

Sonjoy Das
Sonjoy Das

Reputation: 151

You almost got the answer. The key is to use .take() and .skip() to further split the validation set into 2 datasets -- one for validation and the other for test. If I use your example, then you need to execute the following lines of codes. Let's assume that you need 70% for training set, 10% for validation set, and 20% for test set. For the sake of completeness, I am also including the step to generate the training set. Let's also assign a few basic variables that must be same when first splitting the entire data set into training and validation sets.

seed_train_validation = 1 # Must be same for train_ds and val_ds
shuffle_value = True
validation_split = 0.3

train_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "training",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

val_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "validation",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

Next, determine how many batches of data are available in the validation set using tf.data.experimental.cardinality, and then move the two-third of them (2/3 of 30% = 20%) to a test set as follows. Note that the default value of batch_size is 32 (re: documentation).

val_batches = tf.data.experimental.cardinality(val_ds)
test_ds = val_ds.take((2*val_batches) // 3)
val_ds = val_ds.skip((2*val_batches) // 3)

All the three datasets (train_ds, val_ds, and test_ds) yield batches of images together with labels inferred from the directory structure. So, you are good to go from here.

Upvotes: 13

Michael D
Michael D

Reputation: 1767

For splitting into train and validation maybe you can do smth like that.

The main point is to keep the same seed.

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    label_mode='categorical',
    validation_split=0.2,
    subset="training",
    seed=1337,
    color_mode="grayscale",
    image_size=image_size,
    batch_size=batch_size,
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    validation_split=0.2,
    subset="validation",
    label_mode='categorical',
    seed=1337,
    color_mode="grayscale",
    image_size=image_size,
    batch_size=batch_size,
)

is taken from: https://keras.io/examples/vision/image_classification_from_scratch/

Upvotes: 2

Carlos Melus
Carlos Melus

Reputation: 1552

I could not find supporting documentation, but I believe image_dataset_from_directory is taking the end portion of the dataset as the validation split. shuffle is now set to True by default, so the dataset is shuffled before training, to avoid using only some classes for the validation split. The split done by image_dataset_from_directory only relates to the training process. If you need a (highly recommended) test split, you should split your data beforehand into training and testing. Then, image_dataset_from_directory will split your training data into training and validation.

I usually take a smaller percent (10%) for the in-training validation, and split the original dataset 80% training, 20% testing. With these values, the final splits (from the initial dataset size) are:

  • 80% training:
    • 72% training (used to adjust the weights in the network)
    • 8% in-training validation (used only to check the metrics of the model after each epoch)
  • 20% testing (never seen by the training process at all)

There is additional information how to split data in your directories in this question: Keras split train test set when using ImageDataGenerator

Upvotes: 3

Related Questions