Reputation: 31
I'm currently training an image classifier using Nvidia DIGITS. I'm downloading 1,000,000 images as part of the ILSVRC12 dataset. As you may know, this dataset consists of 1,000 classes, with 1,000 images per class. The problem is that a lot of the images are downloaded from dead Flickr URLs, thus populating a decent portion of my dataset (about 5-10%) with the generic "unavailable" image shown below. I plan on going through and deleting each copy of this "generic" image, thus leaving my dataset with only images relevant to each class.
This action would make the size of the classes uneven. They would no longer contain 1,000 images each. They would contain between 900-1,000 images each. Does the size of each class have to be equal? In other words, can I delete these generic images without affecting the accuracy of my classifier? Thanks in advance for you feedback.
Upvotes: 0
Views: 1053
Reputation: 114866
The number of training data per class does not have to be exactly equal. 10% difference one way or another won't affect the training process significantly.
If you are still concern about the label imbalance, you may consider using "InfogainLoss"
layer to compensate for the missing examples.
PS,
You make take advantage of the fact that all invalid flickr photos are in fact identical and remove them automatically based on their md5sum.
See this answer for example on how to filter out these images when downloading imagenet photos.
Upvotes: 1