George
George

Reputation: 2093

keras flow_from_directory over or undersample a class

I'm trying to do a binary classification problem with Keras, using the ImageDataGenerator.flow_from_directory method to generate batches. However, my classes are very imbalanced, like about 8x or 9x more in one class than the other, causing the model to get stuck predicting the same output class for every example. Is there a way to set flow_from_directory to either oversample from my small class or undersample from my large class during each epoch? For now, I've just created multiple copies of each image in my smaller class, but I'd like to have a bit more flexibility.

Upvotes: 18

Views: 11458

Answers (3)

Pasha Dembo
Pasha Dembo

Reputation: 281

One thing you can do is set the class_weight parameter when calling model.fit() or model.fit_generator().

It also happens that you can easily compute your class_weights using sklearn and numpy libraries as follows:

from sklearn.utils import class_weight
import numpy as np

class_weights = class_weight.compute_class_weight(
           'balanced',
            np.unique(train_generator.classes), 
            train_generator.classes)

Afterwards, it becomes as simple as setting your class_weights equal to class_weight parameter:

model.fit_generator(..., class_weight=class_weights) 

Upvotes: 11

Michael
Michael

Reputation: 147

You can also calculate the number of files in each class and normalize the class_weights

files_per_class = []
for folder in os.listdir(input_foldr):
    if not os.path.isfile(folder):
            files_per_class.append(len(os.listdir(input_foldr + '/' + folder)))
total_files = sum(files_per_class)
class_weights = {}
for i in xrange(len(files_per_class)):
    class_weights[i] = 1 - (float(files_per_class[i]) / total_files)
print (class_weights)
...
...
...
model.fit_generator(... ,class_weight=class_weights)

Upvotes: 1

Marcin Możejko
Marcin Możejko

Reputation: 40526

With current version of Keras - it's not possible to balance your dataset using only Keras built-in methods. The flow_from_directory is simply building a list of all files and their classes, shuffling it (if need) and then it's iterating over it.

But you could do a different trick - by writting your own generator which would make the balancing inside the python:

def balanced_flow_from_directory(flow_from_directory, options):
    for x, y in flow_from_directory:
         yield custom_balance(x, y, options)

Here custom_balance should be a function that given a batch (x, y) is balancing it and returning a balanced batch (x', y'). For most of the applications the size of the batch doesn't need to be the same - but there are some weird use cases (like e.g. stateful RNNs) - where batch sizes should have a fixed size).

Upvotes: 13

Related Questions