orome
orome

Reputation: 48476

Bitfusion Ubuntu 14 TensorFlow AMI fails with OOM Errors

Using the "Bitfusion Ubuntu 14 TensorFlow" AMI, any attempt to preform operations with large Tensors, such as

sess.run(tf.argmax(y, 1), feed_dict={x: use_x})

when use_x is a 28,000 tf.Tensor of floats, results in

"Resource Ehausted: OOM”

errors. This renders the AMI unusable for me.

Is there a setting I’m missing to prevent this?

——————————

I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768):     Total Chunks: 1, Chunks in use: 0 56.8KiB allocated for chunks. 3.1KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536):     Total Chunks: 1, Chunks in use: 0 111.2KiB allocated for chunks. 4B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608):   Total Chunks: 2, Chunks in use: 0 23.73MiB allocated for chunks. 440.3KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 83.74MiB was 64.00MiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0300 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0400 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a2400 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a3c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a3d00 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a4a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a4b00 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023d6b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023d6c00 of size 25088000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703bc3c00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703bc5c00 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704737700 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704738f00 of size 60160
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704747a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704747b00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749d00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749f00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704758600 of size 60160
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704767100 of size 76288
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779d00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779f00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a300 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477aa00 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477b700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477b800 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7047ad800 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x705f67a00 of size 8192
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x705f69a00 of size 25088000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x707756a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c8600 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c9e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c9f00 of size 6144
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082e7400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082e7500 of size 25088000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x709ad4500 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a646000 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a646d00 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a678d00 of size 87808000
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70fa36500 of size 3703905024
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x70474a300 of size 58112
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x70531f300 of size 12879616
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x707756b00 of size 12000000
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x7082cb700 of size 113920
I tensorflow/core/common_runtime/bfc_allocator.cc:689]      Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 35 Chunks of size 256 totalling 8.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 3328 totalling 9.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 6144 totalling 24.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 8192 totalling 32.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 60160 totalling 117.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 76288 totalling 74.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 204800 totalling 600.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 12000000 totalling 34.33MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 25088000 totalling 71.78MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 87808000 totalling 83.74MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 3703905024 totalling 3.45GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 3.64GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: 
Limit:                  3928915968
InUse:                  3903864320
MaxInUse:               3903864320
NumAllocs:                  418794
MaxAllocSize:           3703905024

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ******************************************************************************xxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 83.74MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:907] Resource exhausted: OOM when allocating tensor with shape[28000,1,28,28]

Traceback (most recent call last):
  File "tf_simple.py", line 173, in <module>
    evals = sess.run(tf.argmax(y, 1), feed_dict={x: use_x})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 343, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 567, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 640, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 662, in _do_call
    e.code)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[28000,1,28,28]
     [[Node: 1_conv_layer/kernel_logits/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](as_grid, 1_conv_layer/kernel_weights/W1/read)]]
     [[Node: ArgMax/_2316 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1481_ArgMax", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'1_conv_layer/kernel_logits/Conv2D', defined at:
  File "tf_simple.py", line 47, in <module>
    final_dropout=final_dropout)
  File "/home/ubuntu/mlcode/tf_utils.py", line 150, in make_ff_network
    layer_name)
  File "/home/ubuntu/mlcode/tf_utils.py", line 86, in _add_conv_layer
    kernel_logits = tf.nn.conv2d(input_tensor, weights, strides=[1, 1, 1, 1], padding='SAME') + biases
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 295, in conv2d
    data_format=data_format, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 694, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
    self._traceback = _extract_stack()

Upvotes: 0

Views: 251

Answers (1)

mbajkowski
mbajkowski

Reputation: 11

The problem is the memory limit on the AWS GPUs ~ 4GB, it is not a problem with the AMI:

Limit:                  3928915968

InUse:                  3903864320

MaxInUse:               3903864320

NumAllocs:                  418794

MaxAllocSize:           3703905024

The memory limit is 3.928GB, memory used is 3.903GB, and the allocation request is for 0.083GB, which exceeds the memory limit. On AWS your options are either to re-write your code such that it can work within the 4GB limit, run in CPU only mode for that code section and use the system memory (which of course defeats the purpose of using a GPU), or wait for AWS to introduce new GPU instances with larger memory.

Alternatively, you could look for another cloud provider such as Nimbix that offers more up to date GPUs.

Upvotes: 1

Related Questions