Stefan Radonjic
Stefan Radonjic

Reputation: 1598

ResourceExhaustedError when trying to train ResNet on Google Colab

I am trying to train ResNet56 on Google Colab on a custom dataset where each image is 299x299x1 in dimensions. Here is the error I am getting:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,16,299,299] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node resnet/conv2d_21/Conv2D (defined at <ipython-input-15-3b824ba8fe2a>:3) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_21542]

Function call stack:
train_function

And here is my model configuration:

TRAINING_SIZE = 9287
VALIDATION_SIZE = 1194

AUTO = tf.data.experimental.AUTOTUNE # used in tf.data.Dataset API
BATCH_SIZE = 32

model_checkpoint_path = "/content/drive/My Drive/Patch Classifier/Data/patch_classifier_checkpoint"
if not os.path.exists(model_checkpoint_path):
    os.mkdir(model_checkpoint_path)

CALLBACKS = [
              EpochCheckpoint(model_checkpoint_path, every=2, startAt=0),
              TrainingMonitor("/content/drive/My Drive/Patch Classifier/Training/resnet56.png",
                              jsonPath="/content/drive/My Drive/Patch Classifier/Training/resnet56",
                              startAt=0)
              ]

compute_steps_per_epoch = lambda x: int(math.ceil(1. * x / BATCH_SIZE))
steps_per_epoch = compute_steps_per_epoch(TRAINING_SIZE)
val_steps = compute_steps_per_epoch(VALIDATION_SIZE)

opt = SGD(lr=1e-1)
model = ResNet.build(299, 299, 1, 5, (9, 9, 9), (64, 64, 128, 256), reg=0.005)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

history = model.fit(get_batched_dataset("/content/drive/My Drive/Patch Classifier/Data/patch_classifier_train_0.tfrecords"), steps_per_epoch=steps_per_epoch, epochs=10,
                    validation_data=get_batched_dataset("/content/drive/My Drive/Patch Classifier/Data/patch_classifier_val_0.tfrecords"), validation_steps=val_steps,
                    callbacks=CALLBACKS)

Any thoughts?

Upvotes: 0

Views: 981

Answers (2)

anam fariya
anam fariya

Reputation: 1

i was also getting the same error this is because large image size or large batch I was using image size of 512*512 and batch size of 10. i reduced the batch size to 2 and it started working for me .

Upvotes: 0

Natthaphon Hongcharoen
Natthaphon Hongcharoen

Reputation: 2430

There is not many things you can do if you ran out of memory.

What I can think of is either

  1. Reduce BATCH_SIZE
  2. Reduce image input size.

If you choose to reduce batch size then you might need to reduce learning rate too, if you feel like it doesn't converge.

P.S: SGD does a lot better if you put momentum there, like SGD(lr=1e-1, momentum=0.9)

Upvotes: 1

Related Questions