Reputation: 145
I have a data set for a binary classification problem in which both classes are equally represented. Since the data set does not fit into memory (4 million data points), I store it as a HDF5 file that is read and fed incrementally into a simple Keras model via fit_generator
. The problem is I'm getting low validation accuracy with fit_generator
, whereas everything is OK if I simply use fit
. I did mention that the data set does not fit into memory, but for debugging purposes and for the rest of this post I only use 100k of 4M data points.
Since the aim is to do stratified 10-fold CV for the full data set, I manually partition the data set indexes into indexes for training, validation, and evaluation sets. I call fit_generator
with a generator function yielding batches of training (or validation) samples and labels covering the specified indexes from the first quarter of the HDF5 file, then from the second quarter, etc.
I know the validation part of fit_generator
uses test_on_batch
under the hood, as does evaluate_generator
. I also tried a solution using the train_on_batch
and test_on_batch
approach, but with the same result: validation accuracy is low with fit_generator
and the like, but high with fit
if the data set is loaded into memory all at once. The model is the same in both cases (fit
vs fit_generator
).
My debugging data set has ~100k samples and labels (~50k in class 0 and ~50k in class 1). Training and validation is performed on 75% of the data (I have roughly 60k samples for training and 15k for validation). The two classes are equally distributed among the training and validation samples.
Here is the very simple model I use:
input_layer = Input(shape=(2581,), dtype='float32')
hidden_layer = Dense(512, activation='relu', input_shape=(2581, 1))(input_layer)
output_layer = Dense(1, activation='sigmoid')(hidden_layer)
model = Model(inputs=[input_layer], outputs=[output_layer])
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
fit
works great...Since this small data set easily fits into memory, here is how I use fit
directly using the model created above; train_idx
are the indexes for the training set, and valid_idx
are the indexes for the validation set:
model.fit(features[train_idx], labels[train_idx],
batch_size=128, epochs=5,
shuffle=True,
validation_data=(features[valid_idx], labels[valid_idx]))
Here's the val_acc
I get with fit
:
58847/58847 [==============================] - 4s 70us/step - loss: 0.4075 - acc: 0.8334 - val_loss: 0.3259 - val_acc: 0.8828
Epoch 2/5
58847/58847 [==============================] - 4s 61us/step - loss: 0.2757 - acc: 0.8960 - val_loss: 0.2686 - val_acc: 0.9039
Epoch 3/5
58847/58847 [==============================] - 4s 61us/step - loss: 0.2219 - acc: 0.9212 - val_loss: 0.2162 - val_acc: 0.9227
Epoch 4/5
58847/58847 [==============================] - 4s 61us/step - loss: 0.1855 - acc: 0.9353 - val_loss: 0.1992 - val_acc: 0.9314
Epoch 5/5
58847/58847 [==============================] - 4s 60us/step - loss: 0.1583 - acc: 0.9456 - val_loss: 0.1763 - val_acc: 0.9390
fit_generator
doesn'tI would expect the same results with fit_generator
:
model.fit_generator(generate_data(hdf5_file, train_idx, batch_size),
steps_per_epoch=len(train_idx) // batch_size,
epochs=5,
shuffle=False,
validation_data=generate_data(hdf5_file, valid_idx, batch_size),
validation_steps=len(valid_idx) // batch_size)
What I get is the same val_acc
for every epoch, as if only one class was predicted constantly:
460/460 [==============================] - 8s 17ms/step - loss: 0.3230 - acc: 0.9447 - val_loss: 6.9277 - val_acc: 0.4941
Epoch 2/5
460/460 [==============================] - 6s 14ms/step - loss: 0.9536 - acc: 0.8627 - val_loss: 7.1385 - val_acc: 0.4941
Epoch 3/5
460/460 [==============================] - 6s 14ms/step - loss: 0.8764 - acc: 0.8839 - val_loss: 7.0521 - val_acc: 0.4941
Epoch 4/5
460/460 [==============================] - 6s 13ms/step - loss: 0.9005 - acc: 0.8885 - val_loss: 7.0459 - val_acc: 0.4941
Epoch 5/5
460/460 [==============================] - 6s 14ms/step - loss: 0.9259 - acc: 0.8907 - val_loss: 7.0880 - val_acc: 0.4941
Note that:
generate_data
generator is used for both training and validation.fit_generator
is called with shuffle=False
because it is the generator that handles shuffling (in any case, specifying shuffle=True
does not change val_acc
).Last piece of the puzzle: the generator. Here, n_parts
is the number of parts that the HDF5 file is split into for loading. I then keep only the rows -- in the currently loaded part
of the HDF5 file -- that actually fall among the selected indexes
. The kept features (partial_features
) and labels (partial_labels
) are actually the rows at indexes partial_indexes
in the HDF5 file.
def generate_data(hdf5_file, indexes, batch_size, n_parts=4):
part = 0
with h5py.File(hdf5_file, 'r') as h5:
dset = h5.get('features')
part_size = dset.shape[0] // n_parts
while True:
with h5py.File(hdf5_file, 'r') as h5:
dset = h5.get('features')
dset_start = part * part_size
dset_end = (part + 1) * part_size if part < n_parts - 1 else dset.shape[0]
partial_features = dset[dset_start:dset_end, :-1]
partial_labels = dset[dset_start:dset_end, -1]
partial_indexes = list()
for index in indexes:
if dset_start <= index < dset_end:
partial_indexes.append(index)
partial_indexes = np.asarray(partial_indexes)
offset = part * part_size
part = part + 1 if part < n_parts - 1 else 0
if not len(partial_indexes):
continue
partial_features = partial_features[partial_indexes - offset]
partial_labels = partial_labels[partial_indexes - offset]
batch_indexes = [idx for idx in range(0, len(partial_features), batch_size)]
random.shuffle(batch_indexes)
for idx in batch_indexes:
yield np.asarray(partial_features[idx:idx + batch_size, :]), \
np.asarray(partial_labels[idx:idx + batch_size])
I did try shuffling for the training set only, the validation set only, and both. I did try these combinations with shuffle=True
and shuffle=False
in fit_generator
. Apart from the fact that val_acc
may change a bit, it's still essentially at ~0.5 if I use fit_generator
, and ~0.9 if I use fit
.
Do you see anything wrong with my approach? With my generator? Any help is appreciated!
I've been stuck on this problem for 10 days now. Alternatively, what other option (Keras or other library) do I have to train a model on a data set that does not fit into memory?
Upvotes: 0
Views: 1071
Reputation: 145
I finally figured this out and I'll be posting my findings for future reference in case somebody else stumbles upon a similar issue: the generators were not the problem, but the order of the samples in the HDF5 file was.
The model is used for a binary classification problem, where labels in the data set are either zeros or ones. The trouble is that the HDF5 file initially contained all the samples labeled with 1, followed by all the samples labeled with 0 (where the number of positive and negative samples is roughly the same). This means that when the generator function splits the HDF5 file into 4 parts, the first two parts only contain positive samples and the last two parts only contain negative samples.
This can be fixed if samples are written in random order to the HDF5 file such that any contiguous portion of the file roughly contains the same amount of positive and negative samples. This way the model is presented with positive and negative data in roughly equal proportion at any given time during training.
Upvotes: 1