steve
steve

Reputation: 69

model training got stuck after first epoch got completed...2nd epoch won't even start and not throwing any error too it justs stay idle

screenshot showing the model training stuck at epoch 1 without throwing error

enter image description here

I am using google colab pro and here is my code snippet

batch_size = 32
img_height = 256
img_width = 256

train_datagen = ImageDataGenerator(rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
validation_split=0.2) # set validation split

train_generator = train_datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='training') # set as training data

validation_generator = train_datagen.flow_from_directory(
data_dir, # same directory as training data
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='validation') # set as validation data

Found 12442 images belonging to 14 classes.
Found 3104 images belonging to 14 classes.

num_classes = 14

model =Sequential()
chanDim = -1
model.add(Conv2D(16, 3, padding='same', activation='relu', input_shape=(img_height,img_width,3)))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(3, 3)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

with tf.device('/device:GPU:0'):
model.summary()


Total params: 58,091,918
Trainable params: 58,089,070
Non-trainable params: 2,848

model.compile(optimizer='adam',
          loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
          metrics=['accuracy'])

checkpoint_path = "/content/drive/MyDrive/model_checkpoints"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                             save_weights_only=True,
                                             verbose=1)

epochs=10
history=model.fit(
     train_generator,
     steps_per_epoch = train_generator.samples // batch_size,
     validation_data = validation_generator, 
     validation_steps = validation_generator.samples // batch_size,
     epochs = epochs,
     callbacks=[cp_callback])

tensorflow version-2.4.1 keras version-2.4.0

I am using around 15k image dataset and 58k parameters for training. I used image data generator too but when try training the model it completes its first epochs but 2nd epoch won't start it gets stuck but it doesn't throw any error it just stays idle.

Upvotes: 2

Views: 3313

Answers (1)

steve
steve

Reputation: 69

I found that because of the large dataset and 60k params the validation set took so long in model training at first epoch because of default verbose I didn't saw that... so what I did is that I reduced my image size from 260 260 to 180180 which reduced my params to 29 k from 60k and trained my model again but this time I waited for 30 mins for the enter image description herevalidation set (which I can't see the info because of verbose 1 default)) after the training set is completed. In the image attached you can see it says 5389 secs (89 mins) for first epochs but its only training dataset time it didn't add up validation time which took about 30 mins for it...so if u see ur model stuck after the training dataset ..just wait because validation data will be executed.....or use verbose =2

Upvotes: 3

Related Questions