Reputation: 12404
TensorFlow's Keras Model.fit
method has two parameters to limit the number of steps during a training epoch: steps_per_epoch
for the number of training steps and validation_steps
for the number of validation steps. However, a major difference between these two arguments (besides being for training or validation) is how they sample from the dataset. For training, steps_per_epoch
will use the next available samples each epoch, and so each epoch progresses further through the dataset. For validation, validation_steps
will always start from the beginning of the validation. The reason validation_steps
is set to work differently is because the developers wanted to ensure the same data is used for each validation run. In my case, I would prefer this wasn't how validation_steps
worked. My dataset (both training and validation) is quite large, and I would like to validate frequently, and not have it take an excess time. However, I also don't want to validate on only a limited validation dataset. I would like the validation to just sample a random subset of the total dataset for each validation run, so that the smoothed validation curve will give an overall approximation. This would be possible by shuffling the entire validation set, but again, as the dataset is very large all the file paths of the examples can not be loaded into memory simultaneously to be shuffled.
Is there a way to have validation_steps
work the same as steps_per_epoch
in that the data used continually progresses through the dataset on each epoch? Either by some setting of fit
or by somehow wrapping the dataset in such a way that when fit
tries to reset the dataset, it instead samples the next elements in the dataset?
Just to clarify, my data pipeline starts with a pathlib.Path.glob
. This produces a generator. This generator cannot be converted to a list as there are too many paths to fit in memory at once. This generator is used as the source of a TensorFlow Dataset
. Through the Dataset API, I load individual files and preprocess them. The API does this asynchronously from the GPU, using multiple processes. The Dataset API also shuffles a small buffer of the loaded examples. This provides a steady supply of prefetched data for the GPU to train on.
Upvotes: 0
Views: 158
Reputation: 86600
It depends on which kind of generator you're using.
If it's a keras.utils.Sequence
(the standard Keras generator that you get with ImageDataGenerator
and methods like flow_from_dataframe
, flow
etc.), these have a len
property and can have their batches gotten by index.
So, for these you create an array of indices:
batches = len(val_generator)
indices = np.arange(batches)
You can then create your own Python generator like:
def my_gen(val_generator):
batches = len(val_generator)
indices = np.arange(batches)
while True: #these non standard generators using "yield" must be infinite
#shuffle the batches
np.random.shuffle(indices)
#iterate the indices:
for i in indices:
yield val_generator[i]
Fit with your new "untouchable" generator. Keras can't start it over because this option doesn't exist.
model.fit(.... validation_data = my_gen(val_generator), validation_steps = choose)
If it's already a Python generator (that uses yield
), you can be sure Keras is not resetting it, it's not possible. It's a custom generator and all you need is to shuffle your data each cycle, just like above. But instead of shuffling the entire data, shuffle indices. Pretty much the same that I did above, but instead of getting from the Keras generator, get from the data.
def my_gen(dataX, dataY, batch_size):
samples = len(dataX)
indices = np.arange(samples)
batches = samples // batch_size
if samples % batch_size > 0:
batches += 1
while True:
np.random.shuffle(indices)
for b in range(batches):
start = b * batch_size
end = (b+1) * batch_size
batchX = dataX[indices[start:end]]
batchY = dataY[indices[start:end]]
yield batchX, batchY
Upvotes: 1