Reputation: 306
I am trying to train a slim model using 3 GPUs.
I specifically telling TF to use the second GPU to allocate the model:
with tf.device('device:GPU:1'):
logits, end_points = inception_v3(inputs)
However, I'm getting an OOM error on that GPU everytime I run my code. I've tried to reduce the batch_size so the model fits in memory, but the net is ruinned.
I own 3 GPUS so, is there a way to tell TF to use my third GPU when second is full? I've tried not telling TF to use any GPU and allowing soft placemente, but it is not working either.
Upvotes: 1
Views: 508
Reputation: 53758
This statement with tf.device('device:GPU:1')
tells tensorflow specifically to use GPU-1, so it won't attempt to use any other device you have.
When the model is too big, the recommended way is to use model parallelism via manually splitting your graph into different GPUs. The complication in your case is that the model definition is in the library, so you can't insert tf.device
statements for different layers unless you patch tensorflow.
But there is a workaround
You can define and place variables before invoking inception_v3
builder. This way inception_v3
will reuse these variables and not change its placement. Example:
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
with tf.device('device:GPU:1'):
tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/biases", shape=[1000])
tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/weights", shape=[1, 1, 2048, 1000])
with tf.device('device:GPU:0'):
logits, end_points = inception_v3(inputs)
Upon running, you'll see that all variables except Conv2d_1c_1x1
are placed onto GPU-0, while Conv2d_1c_1x1
layer is on GPU-1.
The drawback is that you need to know the shape of each variable you want to replace. But it is doable and at least can get your model running.
Upvotes: 1