Reputation: 2093
I'm trying to do a binary classification problem with Keras, using the ImageDataGenerator.flow_from_directory
method to generate batches. However, my classes are very imbalanced, like about 8x or 9x more in one class than the other, causing the model to get stuck predicting the same output class for every example. Is there a way to set flow_from_directory
to either oversample from my small class or undersample from my large class during each epoch? For now, I've just created multiple copies of each image in my smaller class, but I'd like to have a bit more flexibility.
Upvotes: 18
Views: 11458
Reputation: 281
One thing you can do is set the class_weight
parameter when calling model.fit()
or model.fit_generator()
.
It also happens that you can easily compute your class_weights using sklearn
and numpy
libraries as follows:
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
Afterwards, it becomes as simple as setting your class_weights
equal to class_weight
parameter:
model.fit_generator(..., class_weight=class_weights)
Upvotes: 11
Reputation: 147
You can also calculate the number of files in each class and normalize the class_weights
files_per_class = []
for folder in os.listdir(input_foldr):
if not os.path.isfile(folder):
files_per_class.append(len(os.listdir(input_foldr + '/' + folder)))
total_files = sum(files_per_class)
class_weights = {}
for i in xrange(len(files_per_class)):
class_weights[i] = 1 - (float(files_per_class[i]) / total_files)
print (class_weights)
...
...
...
model.fit_generator(... ,class_weight=class_weights)
Upvotes: 1
Reputation: 40526
With current version of Keras - it's not possible to balance your dataset using only Keras built-in methods. The flow_from_directory
is simply building a list of all files and their classes, shuffling it (if need) and then it's iterating over it.
But you could do a different trick - by writting your own generator which would make the balancing inside the python
:
def balanced_flow_from_directory(flow_from_directory, options):
for x, y in flow_from_directory:
yield custom_balance(x, y, options)
Here custom_balance
should be a function that given a batch (x, y)
is balancing it and returning a balanced batch (x', y')
. For most of the applications the size of the batch doesn't need to be the same - but there are some weird use cases (like e.g. stateful RNNs) - where batch sizes should have a fixed size).
Upvotes: 13