Jean Quezada PAT
Jean Quezada PAT

Reputation: 21

ResourceExhaustedError: OOM when allocating tensor with shape[32,512,64,64] GPU Tesla P4

I'm trying to run a Resnet50 model of the Keras API and use transfer learning for classification in the google cloud platform servers but it gives me the following error:

ResourceExhaustedError: OOM when allocating tensor with shape [32,512,64,64] and type float on / job: localhost / replica: 0 / task: 0 / device: GPU: 0 by allocator GPU_0_bfc
[[node resnet50v2_20210623-124510 / conv3_block2_3_conv / Conv2D (defined at <ipython-input-8-35f36c5d8b4c>: 8)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op: __ inference_train_function_11151]

The input of my network is 512x512x3 with a batch of 32 and the output is 9 classes, where the adaptation of the network is done with this code:

base_model = tf.keras.applications.ResNet50V2 (
    include_top = False,
    weights = 'imagenet',
    input_tensor = input_tensor,
    input_shape = None,
    pooling = 'avg',
    classes = 9,
    classifier_activation = "softmax",
)
base_model.trainable = True
# add new classifier layers
flat = Flatten () (base_model.layers [-1] .output)
out_class = Dense (1024, activation = 'relu') (flat)
output = Dense (9, activation = 'softmax') (out_class)

model = Model (inputs = base_model.inputs, outputs = output)

I use the Nvidia Tesla P4 GPU.

Please help me to understand the error. I have tried on different GCP VMs and it gives me the same error.

Upvotes: 1

Views: 2252

Answers (1)

PjoterS
PjoterS

Reputation: 14084

Posting this Community Wiki for better visibility.

OOM as Out of Memory means that your GPU (Tesla P4 in your case) runs out of memory and it can't allocate memory for this tenserflow.

There are a few things which might resolve this issue:

  1. Reducing Input or Batch size (this worked for OP).

Provided errors contains 4 parameters - [32,512,64,64] First parameter (32) indicates the size of batch_size. Second (512) indicates the number of convolution kernels in a certain layer. Third parameter is the height of the image and fourth is the length of the image (both 64).

Depending on the GPU resources, values of batch_size need to be tuned. For Example if you would compare Tesla P4 and Tesla V100 (both available in GCP) like here, you would be allowed to use a bigger batch in V100 than in P4.

  1. Reduce numbers of layers

  2. Reduce the size of the images (tf.image.resize can be used for that)

  3. Check what process is utilizing your GPU. As OP was using nVidia GPU it can be checked using nvidia-smi or use command ps -fA | grep python. It will show which processes are running and which one consumes GPU. Later you can just kill this process using kill -9 PID and then rerun training.

Upvotes: 1

Related Questions