Reputation: 41
I'm trying to do a binary image classification problem, but the two classes (~590 and ~5900 instances, for class 1 and 2, respectively) are heavily skewed, but still quite distinct.
Is there any way I can fix this, I want to try SMOTE/random weighted oversampling.
I've tried a lot of different things but I'm stuck. I've tried using class_weights=[10,1]
,[5900,590]
, and [1/5900,1/590]
and my model still only predicts class 2.
I've tried using tf.data.experimental.sample_from_datasets
but I couldn't get it to work. I've even tried using sigmoid focal cross-entropy loss, which helped a lot but not enough.
I want to be able to oversample class 1 by a factor of 10, the only thing I have tried that has kinda worked is manually oversampling i.e. copying the train dir's class 1 instances to match the number of instances in class 2.
Is there not an easier way of doing this, I'm using Google Colab and so doing this is extremely inefficient.
Is there a way to specify SMOTE params / oversampling within the data generator or similar?
data/
...class_1/
........image_1.jpg
........image_2.jpg
...class_2/
........image_1.jpg
........image_2.jpg
My data is in the form shown above.
TRAIN_DATAGEN = ImageDataGenerator(rescale = 1./255.,
rotation_range = 40,
width_shift_range = 0.2,
height_shift_range = 0.2,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
TEST_DATAGEN = ImageDataGenerator(rescale = 1.0/255.)
TRAIN_GENERATOR = TRAIN_DATAGEN.flow_from_directory(directory = TRAIN_DIR,
batch_size = BACTH_SIZE,
class_mode = 'binary',
target_size = (IMG_HEIGHT, IMG_WIDTH),
subset = 'training',
seed = DATA_GENERATOR_SEED)
VALIDATION_GENERATOR = TEST_DATAGEN.flow_from_directory(directory = VALIDATION_DIR,
batch_size = BACTH_SIZE,
class_mode = 'binary',
target_size = (IMG_HEIGHT, IMG_WIDTH),
subset = 'validation',
seed = DATA_GENERATOR_SEED)
...
...
...
HISTORY = MODEL.fit(TRAIN_GENERATOR,
validation_data = VALIDATION_GENERATOR,
epochs = EPOCHS,
verbose = 2,
callbacks = [EARLY_STOPPING],
class_weight = CLASS_WEIGHT)
I'm relatively new to Tensorflow but I have some experience with ML as a whole. I've been tempted to switch to PyTorch several times as they have params for data loaders that automatically (over/under)sample with sampler=WeightedRandomSampler
.
Note: I've looked at many tutorials about how to oversample however none of them are image classification problems, I want to stick with TF/Keras as it allows for easy transfer learning, could you guys help out?
Upvotes: 4
Views: 4000
Reputation: 123
You can use this strategy to calculate weights based on the imbalance:
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
train_class_weights = dict(enumerate(class_weights))
model.fit_generator(..., class_weight=train_class_weights)
Upvotes: 3
Reputation: 1308
As you already define your class_weight
as a dictionary, e.g., {0: 10, 1: 1}
, you might try augmenting the minority class. See balancing an imbalanced dataset with keras image generator and the tutorial (that was mentioned there) at https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
Upvotes: 0
Reputation: 2730
In Python you can implement SMOTE using imblearn
library as follows:
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
Upvotes: 0