Reputation: 121
I'm new to tensorflow/keras and I have a file structure with 3000 folders containing 200 images each to be loaded in as data. I know that keras.preprocessing.image_dataset_from_directory allows me to load the data and split it into training/validation set as below:
val_data = tf.keras.preprocessing.image_dataset_from_directory('etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = 0.3,
subset = "validation",
seed = 1,
color_mode = 'grayscale',
shuffle = True)
Found 607200 files belonging to 3036 classes. Using 182160 files for validation.
But then I'm not sure how to further split my validation into a test split while maintaining proper classes. From what I can tell (through the GitHub source code), the take method simply takes the first x elements of the dataset, and skip does the same. I am unsure if this maintains stratification of the data or not, and I'm not quite sure how to return labels from the dataset to test it.
Any help would be appreciated.
Upvotes: 10
Views: 21129
Reputation: 151
You almost got the answer. The key is to use .take()
and .skip()
to further split the validation set into 2 datasets -- one for validation and the other for test. If I use your example, then you need to execute the following lines of codes. Let's assume that you need 70% for training set, 10% for validation set, and 20% for test set. For the sake of completeness, I am also including the step to generate the training set. Let's also assign a few basic variables that must be same when first splitting the entire data set into training and validation sets.
seed_train_validation = 1 # Must be same for train_ds and val_ds
shuffle_value = True
validation_split = 0.3
train_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "training",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)
val_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "validation",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)
Next, determine how many batches of data are available in the validation set using tf.data.experimental.cardinality
, and then move the two-third of them (2/3 of 30% = 20%) to a test set as follows. Note that the default value of batch_size
is 32 (re: documentation).
val_batches = tf.data.experimental.cardinality(val_ds)
test_ds = val_ds.take((2*val_batches) // 3)
val_ds = val_ds.skip((2*val_batches) // 3)
All the three datasets (train_ds
, val_ds
, and test_ds
) yield batches of images together with labels inferred from the directory structure. So, you are good to go from here.
Upvotes: 13
Reputation: 1767
For splitting into train and validation maybe you can do smth like that.
The main point is to keep the same seed.
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
directory,
label_mode='categorical',
validation_split=0.2,
subset="training",
seed=1337,
color_mode="grayscale",
image_size=image_size,
batch_size=batch_size,
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
directory,
validation_split=0.2,
subset="validation",
label_mode='categorical',
seed=1337,
color_mode="grayscale",
image_size=image_size,
batch_size=batch_size,
)
is taken from: https://keras.io/examples/vision/image_classification_from_scratch/
Upvotes: 2
Reputation: 1552
I could not find supporting documentation, but I believe image_dataset_from_directory
is taking the end portion of the dataset as the validation split. shuffle
is now set to True
by default, so the dataset is shuffled before training, to avoid using only some classes for the validation split.
The split done by image_dataset_from_directory
only relates to the training process. If you need a (highly recommended) test split, you should split your data beforehand into training and testing. Then, image_dataset_from_directory
will split your training data into training and validation.
I usually take a smaller percent (10%) for the in-training validation, and split the original dataset 80% training, 20% testing. With these values, the final splits (from the initial dataset size) are:
There is additional information how to split data in your directories in this question: Keras split train test set when using ImageDataGenerator
Upvotes: 3