Reputation: 69
screenshot showing the model training stuck at epoch 1 without throwing error
I am using google colab pro and here is my code snippet
batch_size = 32
img_height = 256
img_width = 256
train_datagen = ImageDataGenerator(rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
validation_split=0.2) # set validation split
train_generator = train_datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='training') # set as training data
validation_generator = train_datagen.flow_from_directory(
data_dir, # same directory as training data
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='validation') # set as validation data
Found 12442 images belonging to 14 classes.
Found 3104 images belonging to 14 classes.
num_classes = 14
model =Sequential()
chanDim = -1
model.add(Conv2D(16, 3, padding='same', activation='relu', input_shape=(img_height,img_width,3)))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(3, 3)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
with tf.device('/device:GPU:0'):
model.summary()
Total params: 58,091,918
Trainable params: 58,089,070
Non-trainable params: 2,848
model.compile(optimizer='adam',
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
checkpoint_path = "/content/drive/MyDrive/model_checkpoints"
checkpoint_dir = os.path.dirname(checkpoint_path)
# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)
epochs=10
history=model.fit(
train_generator,
steps_per_epoch = train_generator.samples // batch_size,
validation_data = validation_generator,
validation_steps = validation_generator.samples // batch_size,
epochs = epochs,
callbacks=[cp_callback])
tensorflow version-2.4.1 keras version-2.4.0
I am using around 15k image dataset and 58k parameters for training. I used image data generator too but when try training the model it completes its first epochs but 2nd epoch won't start it gets stuck but it doesn't throw any error it just stays idle.
Upvotes: 2
Views: 3313
Reputation: 69
I found that because of the large dataset and 60k params the validation set took so long in model training at first epoch because of default verbose I didn't saw that...
so what I did is that I reduced my image size from 260 260 to 180180 which reduced my params to 29 k from 60k and trained my model again but this time I waited for 30 mins for the validation set (which I can't see the info because of verbose 1 default)) after the training set is completed.
In the image attached you can see it says 5389 secs (89 mins) for first epochs but its only training dataset time it didn't add up validation time which took about 30 mins for it...so if u see ur model stuck after the training dataset ..just wait because validation data will be executed.....or use verbose =2
Upvotes: 3