model training got stuck after first epoch got completed...2nd epoch won't even start and not throwing any error too it justs stay idle

Question

screenshot showing the model training stuck at epoch 1 without throwing error

I am using google colab pro and here is my code snippet

batch_size = 32
img_height = 256
img_width = 256

train_datagen = ImageDataGenerator(rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
validation_split=0.2) # set validation split

train_generator = train_datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='training') # set as training data

validation_generator = train_datagen.flow_from_directory(
data_dir, # same directory as training data
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='validation') # set as validation data

Found 12442 images belonging to 14 classes.
Found 3104 images belonging to 14 classes.

num_classes = 14

model =Sequential()
chanDim = -1
model.add(Conv2D(16, 3, padding='same', activation='relu', input_shape=(img_height,img_width,3)))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(3, 3)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

with tf.device('/device:GPU:0'):
model.summary()


Total params: 58,091,918
Trainable params: 58,089,070
Non-trainable params: 2,848

model.compile(optimizer='adam',
          loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
          metrics=['accuracy'])

checkpoint_path = "/content/drive/MyDrive/model_checkpoints"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                             save_weights_only=True,
                                             verbose=1)

epochs=10
history=model.fit(
     train_generator,
     steps_per_epoch = train_generator.samples // batch_size,
     validation_data = validation_generator, 
     validation_steps = validation_generator.samples // batch_size,
     epochs = epochs,
     callbacks=[cp_callback])

tensorflow version-2.4.1 keras version-2.4.0

I am using around 15k image dataset and 58k parameters for training. I used image data generator too but when try training the model it completes its first epochs but 2nd epoch won't start it gets stuck but it doesn't throw any error it just stays idle.

model training got stuck after first epoch got completed...2nd epoch won't even start and not throwing any error too it justs stay idle

Answers (1)

Related Questions

model training got stuck after first epoch got completed...2nd epoch won&#39;t even start and not throwing any error too it justs stay idle

Answers (1)

Related Questions

model training got stuck after first epoch got completed...2nd epoch won't even start and not throwing any error too it justs stay idle