Extremely slow training speed with Keras when using steps_per_epoch argument

Question

I've noticed tremendous training model speed degradation when I specify steps_per_epoch argument in model.fit(..) method. When I specify steps_per_epoch as None (or don't use it) epoch's ETA is 2 seconds straight:

9120/60000 [===>..........................] - ETA: 2s - loss: 0.7055 - acc: 0.7535

When I add steps_per_epoch argument, then ETA bumps up to 5 hours and training speed becomes extremely slow:

5/60000 [..............................] - ETA: 5:50:00 - loss: 1.9749 - acc: 0.3437

Here is the reproducible script:

import tensorflow as tf
from tensorflow import keras
import time

print(tf.__version__)


def get_model():
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    return model


(train_images, train_labels), (test_images, test_labels) = keras.datasets.fashion_mnist.load_data()
train_images = train_images / 255.0

model = get_model()

# Very quick - 2 seconds
start = time.time()
model.fit(train_images, train_labels, epochs=1)
end = time.time()
print("{} seconds", end - start)

model = get_model()

# Very slow - 5 hours
start = time.time()
model.fit(train_images, train_labels, epochs=1, steps_per_epoch=len(train_images))
end = time.time()
print("{} seconds", end - start)

I've also tried with pure Keras and the problem persisted. I use 1.12.0 version of Tensorflow, python 3 and Ubuntu 18.04.1 LTS.

Why does steps_per_epoch argument cause such a significant speed degradation and how can I avoid this?

Thanks!

Daniel M&#246;ller · Accepted Answer

Notice you're using fit with an array of data. You're not using fit_generator or using any generator.

There is no point in using steps_per_epoch unless you are having unconventional ideas.

The default batch size in fit is 32, this means you're training with 60000 // 32 = 1875 steps per epoch.

If you use this number 1875, you're going to train the same number of batches as the default None. If you use 60000 steps, you're multiplying one epoch by 32. (By the huge difference in your speed, I would say the default batch size is also changed in this case)

The total number shown in the output for fitting without steps is the total number of images. Notice how the number of completed items grows in multiples of 32.

The total number shown when you use steps is the number of steps. Notice how the number of completed steps grow 1 by 1.

Extremely slow training speed with Keras when using steps_per_epoch argument

Answers (1)

Related Questions