Number of training samples for each class using ImageDataGenerator with validation_split

Question

Using Keras, I have images in X and labels in Y. Then I do:

 train_datagen = ImageDataGenerator(validation_split = 0.25)

 train_generator = train_datagen.flow(X, Y, subset = 'training')

My question is: when train_generator is used within fit_generator of a model, how many samples from each class are actually presented as training samples?

For example, if I have 1000 (x, y) pairs for 3 classes: 500 for class A, 300 for class B and 200 for class C, how many samples from class A, B and C do fit_generator really see as training samples? Or all we can do is: 500*(1.0 - 0.25) and so on?

today · Accepted Answer

If we inspect the relevant part of the source code, we would realize that the last validation_split * num_samples samples in the X (and y) will be used for the validation and the others will be used for training:

split_idx = int(len(x) * image_data_generator._validation_split)

# ...
if subset == 'validation':
    x = x[:split_idx]
    x_misc = [np.asarray(xx[:split_idx]) for xx in x_misc]
    if y is not None:
        y = y[:split_idx]
else:
    x = x[split_idx:]
    x_misc = [np.asarray(xx[split_idx:]) for xx in x_misc]
    if y is not None:
        y = y[split_idx:]

So it is your responsibility if you want to make sure the proportion of classes is the same in both training and validation subsets (i.e. Keras does not guarantee that when using this functionality). The only thing that Keras verifies is that at least one sample from each class is included in both training and validation subsets:

if not np.array_equal(
        np.unique(y[:split_idx]),
        np.unique(y[split_idx:])):
    raise ValueError('Training and validation subsets '
                     'have different number of classes after '
                     'the split. If your numpy arrays are '
                     'sorted by the label, you might want '
                     'to shuffle them.')

So the solution to have a stratified split (i.e. preserving the porportion of samples for each class in training and validation splits) is to use sklearn.model_selection.train_test_split with stratify argument set:

from sklearn.model_selection import train_test_split

val_split = 0.25
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=val_split, stratify=y)

X = np.concatenate((X_train, X_val))
y = np.concatenate((y_train, y_val))

Now you can pass validation_split=val_split to ImageDataGenerator and it is guaranteed that the proportion of classes is the same in both training and validation subsets.

Number of training samples for each class using ImageDataGenerator with validation_split

Answers (1)

Related Questions