Robert
Robert

Reputation: 313

How to fix "ResourceExhaustedError: OOM when allocating tensor"

I wanna make a model with multiple inputs. So, I try to build a model like this.

# define two sets of inputs
inputA = Input(shape=(32,64,1))
inputB = Input(shape=(32,1024))
 
# CNN
x = layers.Conv2D(32, kernel_size = (3, 3), activation = 'relu')(inputA)
x = layers.Conv2D(32, (3,3), activation='relu')(x)
x = layers.MaxPooling2D(pool_size=(2,2))(x)
x = layers.Dropout(0.2)(x)
x = layers.Flatten()(x)
x = layers.Dense(500, activation = 'relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(500, activation='relu')(x)
x = Model(inputs=inputA, outputs=x)
 
# DNN
y = layers.Flatten()(inputB)
y = Dense(64, activation="relu")(y)
y = Dense(250, activation="relu")(y)
y = Dense(500, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)
 
# Combine the output of the two models
combined = concatenate([x.output, y.output])
 

# combined outputs
z = Dense(300, activation="relu")(combined)
z = Dense(100, activation="relu")(combined)
z = Dense(1, activation="softmax")(combined)

model = Model(inputs=[x.input, y.input], outputs=z)

model.summary()

opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = opt,
    metrics = ['accuracy'])

and the summary : _

But, when i try to train this model,

history = model.fit([trainimage, train_product_embd],train_label,
    validation_data=([validimage,valid_product_embd],valid_label), epochs=10, 
    steps_per_epoch=100, validation_steps=10)

the problem happens.... :

 ResourceExhaustedError                    Traceback (most recent call
 last) <ipython-input-18-2b79f16d63c0> in <module>()
 ----> 1 history = model.fit([trainimage, train_product_embd],train_label,
 validation_data=([validimage,valid_product_embd],valid_label),
 epochs=10, steps_per_epoch=100, validation_steps=10)

 4 frames
 /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py
 in __call__(self, *args, **kwargs)    1470         ret =
 tf_session.TF_SessionRunCallable(self._session._session,    1471      
 self._handle, args,
 -> 1472                                                run_metadata_ptr)    1473         if run_metadata:    1474          
 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
 
 ResourceExhaustedError: 2 root error(s) found.   (0) Resource
 exhausted: OOM when allocating tensor with shape[800000,32,30,62] and
 type float on /job:localhost/replica:0/task:0/device:GPU:0 by
 allocator GPU_0_bfc     [[{{node conv2d_1/convolution}}]] Hint: If you
 want to see a list of allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
     [[metrics/acc/Mean_1/_185]] Hint: If you want to see a list of
 allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
   (1) Resource exhausted: OOM when allocating tensor with
 shape[800000,32,30,62] and type float on
 /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc    
 [[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of
 allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
 0 successful operations. 0 derived errors ignored.

Thanks for reading and hopefully helping me :)

Upvotes: 21

Views: 67638

Answers (6)

Abramodj
Abramodj

Reputation: 5879

In my case, a stale process was using most of the RAM available on the GPU.

Simple solution: (for Nvidia GPUs)

  • Type nvidia-smi on the terminal and note the PID of the process(es) with large memory usage
  • Tpye sudo kill PID, with the PID from above

Upvotes: 0

Osama Arshad
Osama Arshad

Reputation: 1

Solutions :

  • Reduce your Dimension because of the limited RAM on GPU. like, Nvidia GTX 1060 3gb
  • Reduce your Batchsize of datagen.flow (by default set 32 so you have to set 8/16/24

Upvotes: 0

25_Ananya Y
25_Ananya Y

Reputation: 16

I think the most common reason for this case to arise would be the absence of MaxPooling layers. Use the same architecture, but add atleast one MaxPool layer after Conv2D layers. This might even improve the overall performance of the model. You can even try reducing the depth of the model, i.e., remove the unnecessary layers and optimize.

Upvotes: 0

Nicolas Gervais
Nicolas Gervais

Reputation: 36594

OOM stands for "out of memory". Your GPU is running out of memory, so it can't allocate memory for this tensor. There are a few things you can do:

  • Decrease the number of filters in your Dense, Conv2D layers
  • Use a smaller batch_size (or increase steps_per_epoch and validation_steps)
  • Use grayscale images (you can use tf.image.rgb_to_grayscale)
  • Reduce the number of layers
  • Use MaxPooling2D layers after convolutional layers
  • Reduce the size of your images (you can use tf.image.resize for that)
  • Use smaller float precision for your input, namely np.float32
  • If you're using a pre-trained model, freeze the first layers (like this)

There is more useful information about this error:

OOM when allocating tensor with shape[800000,32,30,62]

This is a weird shape. If you're working with images, you should normally have 3 or 1 channel. On top of that, it seems like you are passing your entire dataset at once; you should instead pass it in batches.

Upvotes: 60

Debayan Mitra
Debayan Mitra

Reputation: 53

Happened to me as well.

You can try reducing trainable parameters by using some form of Transfer Learning - try freezing the initial few layers and use lower batch sizes.

Upvotes: 0

Natthaphon Hongcharoen
Natthaphon Hongcharoen

Reputation: 2430

From [800000,32,30,62] it seems your model put all the data in one batch.

Try specified batch size like

history = model.fit([trainimage, train_product_embd],train_label, validation_data=([validimage,valid_product_embd],valid_label), epochs=10, steps_per_epoch=100, validation_steps=10, batch_size=32)

If it still OOM then try reduce the batch_size

Upvotes: 2

Related Questions