what is the correct way of splitting dataset into train, validation and test?

Question

I'm following an article which says the test folder should also contain a single folder inside which all the test images are present(there will not be subfolders/label folders). On the other hand the train and validation folders should contain ‘n’ folders each containing images of the respective classes. For example:

structure 1

/Data
     //train
          classA folder
          classB folder
          classC folder
     //val
          classA folder
          classB folder
          classC folder
     //test
          test folder

Again, I learned about using the python library split-folder which splits the data in the following structure,

structure 2

/Data
     //train
          classA folder
          classB folder
          classC folder
     //val
          classA folder
          classB folder
          classC folder
     //test
          classA folder
          classB folder
          classC folder

I implemented one by using the python library split-folder (structure 2) and evaluated the model by using the following method,

model.evaluate(test_generator,batch_size=32)

here I only provided test_generator(which I got from flow_from_directory) to my evaluate function(I did not use any labels) and I got accuracy around 88%. my confusions are:

Which structure should I follow for the data splitting?
How can I evaluate or predict my model if I use structure 1? How can I extract labels from the data?
How, despite the fact that I did not supply any labels, the Python library split-folder is evaluating the model without throwing an error?

Djinn · Accepted Answer

It looks like Structure 2 just splits whatever is available, which is fundamentally correct. In reality, you'll most likely be using Structure 1 when using flow_from_directory(). You can't perform evaluate() without labels, so your test_generator is more akin to a validation set, but you can technically evaluate using that "test" data since they would created the same way, but ideally used differently.
flow_from_directory() outputs Dataset, which contains features classes and labels class_indices. When you pass a Dataset object to evaluate(), the method uses both features and labels from the passed variable. If you want to extract the labels from an object from flow_from_directory(), if the variable is x, it's x.class_indices, which will be a dictionary. When you pass a Dataset to predict(), only the features are used. The labels are ignored. Unless you need to manually retrieve something within the Dataset object, you don't need to access anything within that object when evaluating or predicting.
split_folder does not do anything with your model.

The subfolders named after classes for train and val is when comparing the image to the label (the folder name - the class). This is how flow_from_directory() keeps track of each image's class. Since prediction is supposed to use unseen data as input, it wouldn't have a label to compare to, hence no labels (or subfolder containing classes) when you split your test folder out.

Another thing you could do which is common is, just splitting your train and test set, then creating your validation set from your training set. But both methods are fundamentally the same.

what is the correct way of splitting dataset into train, validation and test?

Answers (1)

Related Questions