LolAsdOmgWtfAfk
LolAsdOmgWtfAfk

Reputation: 109

Using tf.map_fn with multiple GPUs

I'm trying to extend my single-GPU TensorFlow code to multi-GPU. I have to work on 3 degrees of freedom and unfortunately I need to use tf.map_fn to parallelize over the 3rd one. I tried to use device placement as shown in the official documentation, but it looks like it is impossible to do it with tf.map_fn. Is there a way to run tf.map_fn on multiple GPUs?

Here the error output:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'map_1/TensorArray_1': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/device:GPU:1'
Colocation Debug Info:
Colocation group had the following types and devices: 
TensorArrayGatherV3: GPU CPU 
Range: GPU CPU 
TensorArrayWriteV3: GPU CPU 
TensorArraySizeV3: GPU CPU 
MatMul: GPU CPU 
Enter: GPU CPU 
TensorArrayV3: GPU CPU 
Const: GPU CPU 

Colocation members and user-requested devices:
  map_1/TensorArrayStack/range/delta (Const) 
  map_1/TensorArrayStack/range/start (Const) 
  map_1/TensorArray_1 (TensorArrayV3) 
  map_1/while/TensorArrayWrite/TensorArrayWriteV3/Enter (Enter) /device:GPU:1
  map_1/TensorArrayStack/TensorArraySizeV3 (TensorArraySizeV3) 
  map_1/TensorArrayStack/range (Range) 
  map_1/TensorArrayStack/TensorArrayGatherV3 (TensorArrayGatherV3) 
  map_1/while/MatMul (MatMul) /device:GPU:1
  map_1/while/TensorArrayWrite/TensorArrayWriteV3 (TensorArrayWriteV3) /device:GPU:1

         [[Node: map_1/TensorArray_1 = TensorArrayV3[clear_after_read=true, dtype=DT_FLOAT, dynamic_size=false, element_shape=<unknown>, identical_element_shapes=true, tensor_array_name=""](map_1/TensorArray_1/size)]]

Here a simple code example to reproduce it:

import tensorflow as tf
import numpy

rc = 1000

sess = tf.Session()

for deviceName in ['/cpu:0', '/device:GPU:0', '/device:GPU:1']:
        with tf.device(deviceName):
                matrices = tf.random_uniform([rc,rc,4],minval = 0, maxval = 1, dtype = tf.float32)

                def mult(i):
                        product = tf.matmul(matrices[:,:,i],matrices[:,:,i+1])
                        return product

                mul = tf.zeros([rc,rc,3], dtype = tf.float32)
                mul = tf.map_fn(mult, numpy.array([0,1,2]), dtype = tf.float32, parallel_iterations = 10)

m = sess.run(mul)


Upvotes: 1

Views: 419

Answers (1)

McAngus
McAngus

Reputation: 1856

What you are trying to do can be accomplished by batch matmul. Consider the following changes:

import tensorflow as tf
import numpy
import time
import numpy as np

rc = 1000

sess = tf.Session()

#compute on cpu for comparison later
vals = np.random.uniform(size=[rc,rc,4]).astype(np.float32)
mat1 = tf.identity(vals)
mat2 = tf.transpose(vals, [2, 0, 1])

#store mul in array so all are fetched in run call
muls = []
#I only have one GPU.
for deviceName in ['/cpu:0', '/device:GPU:0']:
    with tf.device(deviceName):

        def mult(i):
                product = tf.matmul(mat1[:,:,i],mat1[:,:,i+1])
                return product

        mul = tf.zeros([rc,rc,3], dtype = tf.float32)
    mul = tf.map_fn(mult, numpy.array([0,1,2]), dtype = tf.float32, parallel_iterations = 10)
    muls.append(mul)

#use transposed mat with a shift to matmul in one go
mul = tf.matmul(mat2[:-1], mat2[1:])

print(muls)
print(mul)

start = time.time()
m1 = sess.run(muls)
end = time.time()

print("muls:", end - start)

start = time.time()
m2 = sess.run(mul)
end = time.time()

print("mul:", end - start)

print(np.allclose(m1[0],m1[1]))
print(np.allclose(m1[0],m2))
print(np.allclose(m1[1],m2))

The results on my PC are:

[<tf.Tensor 'map/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>, <tf.Tensor 'map_1/TensorArrayStack/TensorArrayGatherV3:0' shape=(3, 1000, 1000) dtype=float32>]
Tensor("MatMul:0", shape=(3, 1000, 1000), dtype=float32)
muls: 0.4262731075286865
mul: 0.3794088363647461
True
True
True

You rarely want to use the CPU synchronously with the GPU as it's going to be the bottleneck. The GPUs will be waiting for the CPU to finish. If you do anything with the CPU it should be asynchronous to the GPU so they can run full tilt.

Upvotes: 1

Related Questions