Reputation: 109
I'm trying to extend my single-GPU TensorFlow code to multi-GPU. I have to work on 3 degrees of freedom and unfortunately I need to use tf.map_fn to parallelize over the 3rd one. I tried to use device placement as shown in the official documentation, but it looks like it is impossible to do it with tf.map_fn
. Is there a way to run tf.map_fn
on multiple GPUs?
Here the error output:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'map_1/TensorArray_1': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/device:GPU:1'
Colocation Debug Info:
Colocation group had the following types and devices:
TensorArrayGatherV3: GPU CPU
Range: GPU CPU
TensorArrayWriteV3: GPU CPU
TensorArraySizeV3: GPU CPU
MatMul: GPU CPU
Enter: GPU CPU
TensorArrayV3: GPU CPU
Const: GPU CPU
Colocation members and user-requested devices:
map_1/TensorArrayStack/range/delta (Const)
map_1/TensorArrayStack/range/start (Const)
map_1/TensorArray_1 (TensorArrayV3)
map_1/while/TensorArrayWrite/TensorArrayWriteV3/Enter (Enter) /device:GPU:1
map_1/TensorArrayStack/TensorArraySizeV3 (TensorArraySizeV3)
map_1/TensorArrayStack/range (Range)
map_1/TensorArrayStack/TensorArrayGatherV3 (TensorArrayGatherV3)
map_1/while/MatMul (MatMul) /device:GPU:1
map_1/while/TensorArrayWrite/TensorArrayWriteV3 (TensorArrayWriteV3) /device:GPU:1
[[Node: map_1/TensorArray_1 = TensorArrayV3[clear_after_read=true, dtype=DT_FLOAT, dynamic_size=false, element_shape=<unknown>, identical_element_shapes=true, tensor_array_name=""](map_1/TensorArray_1/size)]]
Here a simple code example to reproduce it:
import tensorflow as tf
import numpy
rc = 1000
sess = tf.Session()
for deviceName in ['/cpu:0', '/device:GPU:0', '/device:GPU:1']:
with tf.device(deviceName):
matrices = tf.random_uniform([rc,rc,4],minval = 0, maxval = 1, dtype = tf.float32)
def mult(i):
product = tf.matmul(matrices[:,:,i],matrices[:,:,i+1])
return product
mul = tf.zeros([rc,rc,3], dtype = tf.float32)
mul = tf.map_fn(mult, numpy.array([0,1,2]), dtype = tf.float32, parallel_iterations = 10)
m = sess.run(mul)
Upvotes: 1
Views: 419
Reputation: 1856
What you are trying to do can be accomplished by batch matmul. Consider the following changes:
import tensorflow as tf
import numpy
import time
import numpy as np
rc = 1000
sess = tf.Session()
#compute on cpu for comparison later
vals = np.random.uniform(size=[rc,rc,4]).astype(np.float32)
mat1 = tf.identity(vals)
mat2 = tf.transpose(vals, [2, 0, 1])
#store mul in array so all are fetched in run call
muls = []
#I only have one GPU.
for deviceName in ['/cpu:0', '/device:GPU:0']:
with tf.device(deviceName):
def mult(i):
product = tf.matmul(mat1[:,:,i],mat1[:,:,i+1])
return product
mul = tf.zeros([rc,rc,3], dtype = tf.float32)
mul = tf.map_fn(mult, numpy.array([0,1,2]), dtype = tf.float32, parallel_iterations = 10)
muls.append(mul)
#use transposed mat with a shift to matmul in one go
mul = tf.matmul(mat2[:-1], mat2[1:])
print(muls)
print(mul)
start = time.time()
m1 = sess.run(muls)
end = time.time()
print("muls:", end - start)
start = time.time()
m2 = sess.run(mul)
end = time.time()
print("mul:", end - start)
print(np.allclose(m1[0],m1[1]))
print(np.allclose(m1[0],m2))
print(np.allclose(m1[1],m2))
The results on my PC are:
[<tf.Tensor 'map/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>, <tf.Tensor 'map_1/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>]
Tensor("MatMul:0", shape=(3, 1000, 1000), dtype=float32)
muls: 0.4262731075286865
mul: 0.3794088363647461
True
True
True
You rarely want to use the CPU synchronously with the GPU as it's going to be the bottleneck. The GPUs will be waiting for the CPU to finish. If you do anything with the CPU it should be asynchronous to the GPU so they can run full tilt.
Upvotes: 1