Reputation: 3349
Minimal example to demonstrate the problem:
import tensorflow as tf
with tf.distribute.MirroredStrategy().scope():
print(tf.Variable(1.))
Output on a 4-GPU server:
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
MirroredVariable:{
0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=0.0>,
2: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>,
3: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>
}
The problem is, as seen above, that the replicas do not contain the correct variable value, all are zero values except on the first device (the numpy=0.0
parts). This is the same with 2 or 3 devices as well, not just with all 4.
The same code does produce the expected behavior on a different machine.
Correct output on a different, 2-GPU workstation:
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
MirroredVariable:{
0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
}
(Note the value 1.0 on both devices)
The problematic machine is a Dell PowerEdge R750xa with 4x Nvidia A40 GPUs.
The correctly working machine has 2x Titan RTX.
Software config on both:
What could be the reason for such behavior? Glad to provide more details.
Upvotes: 3
Views: 318
Reputation: 1
This is not an answer but I think it will be helpful to add here as I cannot add a comment. I have a similar issue and had the question for a while here with no answer: MirroredStrategy output varies depending on the visible GPUs
I think as a workaround for now, you can run nvidia-smi topo -m
and check the connections between GPUs
In my case, your example works fine when setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
but fails if CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,0
Upvotes: 0