Correctly splitting the dataset

Question

I have downloaded a dataset of 10 class objects for the object detection. The dataset is not divided into training, validation, and testing. However, the author has mentioned in his paper to divide the dataset in 20% Training, 20% Validation, and 60% Testing and images are choose randomly.

Following the criteria said by the author, I have randomly selected 20% images for Training, 20% images for Validation, and 60% images Testing.

I want to know couple of things
1) Do I need to put difficult images in training set or validation set or testing set? for example currently there is 41 difficult images in test set, 30 in Training set and 20 in validation set.
2) How can I ensure that all ten object classes are equally distributed?
Updated
3)Ideally, for balance split difficult images should be equally distributed? and how much it effect the result if testing have more difficult, or training have more difficult or validation have more?

Ten classes: Airplane, Storage tank, Baseball ground, Tennis Court, Basketball court, ground track field, Bridge, Ship, Harbor, and Vehicle.

I have total 650 images, among them 466 images have exactly one class and there are more than one objects in a image Airplane = 88 images, Storage tank= 10 images, Baseball ground=46 images, Tennis Court =29 images, Basketball court =32 images, ground track field=55 images, Bridge 58 images, Ship=36 images, Harbor 27 images, and Vehicle=85 images.

Remaining 184 images have multiple classes.

In total 757 airplanes, 302 ships, 655 storage tanks, 390 baseball diamonds, 524 tennis courts, 159 basketball courts, 163 ground track fields, 224 harbors, 124 bridges, and 477 vehicles

Rob · Accepted Answer

The most common technique is a random selection. For example, if you have 1000 images you can create an array that contains the names of every file and you can aleatorize the elements using a random permutation. Then you can use the first 200 elements for training, the next 200 elements for validation and the other elements for test (in the case of 20%,20%,60%)

If there is a extremely unbalanced class you can force the same proportion of classes in every set. To do that you must do the procedure that I mentioned class by class.

You shouldn't choose images by hand. If you know that there are some difficult images in your dataset you can not choose them by hand to include them in the train, validation and test set.

If you want a fair comparison of your algorithm, if a few images can highly modify the accuracy. You can repeat the random split several times. In some cases there will be many difficult images in the training set, and in other cases in the validation or test set. Then you can privide the mean and standard deviation of your accuracy (or the metric that you are using).

UPDATED:

I see, in your description you have more than 1 object in a image. Isn't it? For example, can you have two ships and one bridge? I use to work with datasets that contain a single object in every image. Then to detect several objects in a image I scan different parts of a image looking for single objects.

Probably the author of the paper that you have mentioned divided the dataset randomly. If you use a more complex division in a research paper you should mention it.

About your question about how is the effect of having more diffecult images in every set, the answer is very complex. It depends on the algorithm and how similar are the images of the training set when comparing with the images of the validation and test set.

With a complex model (for example a Neural Net with a lot of layers and neurons) you can obtain the accuracy you want on the traning set(for example 100%). Then if the images are very similar to the images in the validation and test set the accuracy will be similar. But if they are not very similar you have overfitted and the accuracy will be slower in the validation and test set. To solve that you need a simpler model (for example reducing the number of neurons or using a good regularization technique), in that case the accuracy will be slower in the training set but the accuracy of the validation and test set will be closer to the accuracy obtained with the training set.

Correctly splitting the dataset

Answers (1)

Related Questions