user3638629
user3638629

Reputation: 145

Low validation accuracy with Keras `fit_generator` but not with `fit`

I have a data set for a binary classification problem in which both classes are equally represented. Since the data set does not fit into memory (4 million data points), I store it as a HDF5 file that is read and fed incrementally into a simple Keras model via fit_generator. The problem is I'm getting low validation accuracy with fit_generator, whereas everything is OK if I simply use fit. I did mention that the data set does not fit into memory, but for debugging purposes and for the rest of this post I only use 100k of 4M data points.

Since the aim is to do stratified 10-fold CV for the full data set, I manually partition the data set indexes into indexes for training, validation, and evaluation sets. I call fit_generator with a generator function yielding batches of training (or validation) samples and labels covering the specified indexes from the first quarter of the HDF5 file, then from the second quarter, etc.

I know the validation part of fit_generator uses test_on_batch under the hood, as does evaluate_generator. I also tried a solution using the train_on_batch and test_on_batch approach, but with the same result: validation accuracy is low with fit_generator and the like, but high with fit if the data set is loaded into memory all at once. The model is the same in both cases (fit vs fit_generator).

Data set and model

My debugging data set has ~100k samples and labels (~50k in class 0 and ~50k in class 1). Training and validation is performed on 75% of the data (I have roughly 60k samples for training and 15k for validation). The two classes are equally distributed among the training and validation samples.

Here is the very simple model I use:

input_layer = Input(shape=(2581,), dtype='float32')
hidden_layer = Dense(512, activation='relu', input_shape=(2581, 1))(input_layer)
output_layer = Dense(1, activation='sigmoid')(hidden_layer)

model = Model(inputs=[input_layer], outputs=[output_layer])
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

fit works great...

Since this small data set easily fits into memory, here is how I use fit directly using the model created above; train_idx are the indexes for the training set, and valid_idx are the indexes for the validation set:

model.fit(features[train_idx], labels[train_idx],
          batch_size=128, epochs=5,
          shuffle=True,
          validation_data=(features[valid_idx], labels[valid_idx]))

Here's the val_acc I get with fit:

58847/58847 [==============================] - 4s 70us/step - loss: 0.4075 - acc: 0.8334 - val_loss: 0.3259 - val_acc: 0.8828
Epoch 2/5
58847/58847 [==============================] - 4s 61us/step - loss: 0.2757 - acc: 0.8960 - val_loss: 0.2686 - val_acc: 0.9039
Epoch 3/5
58847/58847 [==============================] - 4s 61us/step - loss: 0.2219 - acc: 0.9212 - val_loss: 0.2162 - val_acc: 0.9227
Epoch 4/5
58847/58847 [==============================] - 4s 61us/step - loss: 0.1855 - acc: 0.9353 - val_loss: 0.1992 - val_acc: 0.9314
Epoch 5/5
58847/58847 [==============================] - 4s 60us/step - loss: 0.1583 - acc: 0.9456 - val_loss: 0.1763 - val_acc: 0.9390

... but fit_generator doesn't

I would expect the same results with fit_generator:

model.fit_generator(generate_data(hdf5_file, train_idx, batch_size),
                    steps_per_epoch=len(train_idx) // batch_size,
                    epochs=5,
                    shuffle=False,
                    validation_data=generate_data(hdf5_file, valid_idx, batch_size),
                    validation_steps=len(valid_idx) // batch_size)

What I get is the same val_acc for every epoch, as if only one class was predicted constantly:

460/460 [==============================] - 8s 17ms/step - loss: 0.3230 - acc: 0.9447 - val_loss: 6.9277 - val_acc: 0.4941
Epoch 2/5
460/460 [==============================] - 6s 14ms/step - loss: 0.9536 - acc: 0.8627 - val_loss: 7.1385 - val_acc: 0.4941
Epoch 3/5
460/460 [==============================] - 6s 14ms/step - loss: 0.8764 - acc: 0.8839 - val_loss: 7.0521 - val_acc: 0.4941
Epoch 4/5
460/460 [==============================] - 6s 13ms/step - loss: 0.9005 - acc: 0.8885 - val_loss: 7.0459 - val_acc: 0.4941
Epoch 5/5
460/460 [==============================] - 6s 14ms/step - loss: 0.9259 - acc: 0.8907 - val_loss: 7.0880 - val_acc: 0.4941

Note that:

The generator method

Last piece of the puzzle: the generator. Here, n_parts is the number of parts that the HDF5 file is split into for loading. I then keep only the rows -- in the currently loaded part of the HDF5 file -- that actually fall among the selected indexes. The kept features (partial_features) and labels (partial_labels) are actually the rows at indexes partial_indexes in the HDF5 file.

def generate_data(hdf5_file, indexes, batch_size, n_parts=4):
    part = 0
    with h5py.File(hdf5_file, 'r') as h5:
        dset = h5.get('features')
        part_size = dset.shape[0] // n_parts

    while True:
        with h5py.File(hdf5_file, 'r') as h5:
            dset = h5.get('features')
            dset_start = part * part_size
            dset_end = (part + 1) * part_size if part < n_parts - 1 else dset.shape[0]
            partial_features = dset[dset_start:dset_end, :-1]
            partial_labels = dset[dset_start:dset_end, -1]

        partial_indexes = list()
        for index in indexes:
            if dset_start <= index < dset_end:
                partial_indexes.append(index)
        partial_indexes = np.asarray(partial_indexes)

        offset = part * part_size
        part = part + 1 if part < n_parts - 1 else 0
        if not len(partial_indexes):
            continue

        partial_features = partial_features[partial_indexes - offset]
        partial_labels = partial_labels[partial_indexes - offset]

        batch_indexes = [idx for idx in range(0, len(partial_features), batch_size)]

        random.shuffle(batch_indexes)
        for idx in batch_indexes:
            yield np.asarray(partial_features[idx:idx + batch_size, :]), \
                  np.asarray(partial_labels[idx:idx + batch_size])

I did try shuffling for the training set only, the validation set only, and both. I did try these combinations with shuffle=True and shuffle=False in fit_generator. Apart from the fact that val_acc may change a bit, it's still essentially at ~0.5 if I use fit_generator, and ~0.9 if I use fit.

Do you see anything wrong with my approach? With my generator? Any help is appreciated!

I've been stuck on this problem for 10 days now. Alternatively, what other option (Keras or other library) do I have to train a model on a data set that does not fit into memory?

Upvotes: 0

Views: 1071

Answers (1)

user3638629
user3638629

Reputation: 145

I finally figured this out and I'll be posting my findings for future reference in case somebody else stumbles upon a similar issue: the generators were not the problem, but the order of the samples in the HDF5 file was.

The model is used for a binary classification problem, where labels in the data set are either zeros or ones. The trouble is that the HDF5 file initially contained all the samples labeled with 1, followed by all the samples labeled with 0 (where the number of positive and negative samples is roughly the same). This means that when the generator function splits the HDF5 file into 4 parts, the first two parts only contain positive samples and the last two parts only contain negative samples.

This can be fixed if samples are written in random order to the HDF5 file such that any contiguous portion of the file roughly contains the same amount of positive and negative samples. This way the model is presented with positive and negative data in roughly equal proportion at any given time during training.

Upvotes: 1

Related Questions