golmschenk
golmschenk

Reputation: 12404

Use newly sampled validation examples with TensorFlow Keras fit when using `validation_steps`?

TensorFlow's Keras Model.fit method has two parameters to limit the number of steps during a training epoch: steps_per_epoch for the number of training steps and validation_steps for the number of validation steps. However, a major difference between these two arguments (besides being for training or validation) is how they sample from the dataset. For training, steps_per_epoch will use the next available samples each epoch, and so each epoch progresses further through the dataset. For validation, validation_steps will always start from the beginning of the validation. The reason validation_steps is set to work differently is because the developers wanted to ensure the same data is used for each validation run. In my case, I would prefer this wasn't how validation_steps worked. My dataset (both training and validation) is quite large, and I would like to validate frequently, and not have it take an excess time. However, I also don't want to validate on only a limited validation dataset. I would like the validation to just sample a random subset of the total dataset for each validation run, so that the smoothed validation curve will give an overall approximation. This would be possible by shuffling the entire validation set, but again, as the dataset is very large all the file paths of the examples can not be loaded into memory simultaneously to be shuffled.

Is there a way to have validation_steps work the same as steps_per_epoch in that the data used continually progresses through the dataset on each epoch? Either by some setting of fit or by somehow wrapping the dataset in such a way that when fit tries to reset the dataset, it instead samples the next elements in the dataset?

Just to clarify, my data pipeline starts with a pathlib.Path.glob. This produces a generator. This generator cannot be converted to a list as there are too many paths to fit in memory at once. This generator is used as the source of a TensorFlow Dataset. Through the Dataset API, I load individual files and preprocess them. The API does this asynchronously from the GPU, using multiple processes. The Dataset API also shuffles a small buffer of the loaded examples. This provides a steady supply of prefetched data for the GPU to train on.

Upvotes: 0

Views: 158

Answers (1)

Daniel Möller
Daniel Möller

Reputation: 86600

It depends on which kind of generator you're using.

If it's a keras.utils.Sequence (the standard Keras generator that you get with ImageDataGenerator and methods like flow_from_dataframe, flow etc.), these have a len property and can have their batches gotten by index.

So, for these you create an array of indices:

batches = len(val_generator)
indices = np.arange(batches)

You can then create your own Python generator like:

def my_gen(val_generator):
    batches = len(val_generator)
    indices = np.arange(batches)

    while True: #these non standard generators using "yield" must be infinite

        #shuffle the batches
        np.random.shuffle(indices)

        #iterate the indices:
        for i in indices:
            yield val_generator[i]

Fit with your new "untouchable" generator. Keras can't start it over because this option doesn't exist.

model.fit(.... validation_data = my_gen(val_generator), validation_steps = choose)

If it's already a Python generator (that uses yield), you can be sure Keras is not resetting it, it's not possible. It's a custom generator and all you need is to shuffle your data each cycle, just like above. But instead of shuffling the entire data, shuffle indices. Pretty much the same that I did above, but instead of getting from the Keras generator, get from the data.

def my_gen(dataX, dataY, batch_size):
    samples = len(dataX)
    indices = np.arange(samples)

    batches = samples // batch_size
    if samples % batch_size > 0:
        batches += 1

    while True:
        np.random.shuffle(indices)

        for b in range(batches):
            start = b * batch_size
            end = (b+1) * batch_size

            batchX = dataX[indices[start:end]]
            batchY = dataY[indices[start:end]]
            yield batchX, batchY

Upvotes: 1

Related Questions